Intelligent Signatures using Embeddings and K-Nearest-Neighbors

November 19, 2020

At Abnormal Security, one of our key objectives is to build a detection engine that can continuously adapt to a changing attack landscape. As such, we want to ensure that our systems can rapidly adjust to recent and high-value messages—even with a low number of examples. We have frequently retrained ML models to catch general trends, but we additionally would like to supplement these models with so-called signatures.

In classic email security, a signature is usually an exact match on IP, domain, content, etc. These are used as a backup to ML models to help ensure that if the system does make a mistake, either a false positive or false negative, it will not make that same mistake again.

The challenge here, though, is that this exact match approach on simple attributes of a message doesn’t actually generalize very well in catching similar messages. IP signatures work well in some circumstances, but attackers oftentimes switch up IPs or use botnets. To address this issue, we need to use “intelligent” signatures that can perform a more sophisticated fuzzy matching of sorts and automatically detect when two messages are alike.

But how can we implement this matching, if not with strictly defined rules? At Abnormal, we have many ML models that represent various dimensions of a message. For example:

  1. NLP models on the content of the messages
  2. Behavior models over the communication graph associated with a message
  3. Link models representing URLs and link content

Our approach is to produce representations of a message along the dimensions already incorporated by our models. From there, we can classify messages using a simple but effective approach utilizing the k-nearest neighbors algorithm (KNN). Specifically, each message we process is represented as a compact embedding, obtained from the layers of a deep multimodal neural network. For any new incoming message, we can find the previous messages with the closest embeddings, as measured by cosine distance or a similar metric, and classify it as malicious or not accordingly.

Image for post
Visualization of the nearest neighbors approach underlying our intelligent signatures

We can repeat this process multiple times using various embeddings and combine the results for a signature score. This is, in effect, finding messages that are nearby in various spaces (for example, nearby in content, nearby in terms of the parties involved, etc.) and using that information to make a classification decision.

Image for post
Diagram of the broader system design

While this general approach is fairly straightforward, there are several practical constraints. For example, if we were to store embeddings of every single message we have seen, we would need an extremely large amount of memory, which would pose a challenge from a systems standpoint. Furthermore, the latency of finding the nearest neighbors would be too high to support real-time applications.

To work around this challenge, we only store embeddings of recent messages, with different sampling schemes used for attacks and non-attacks to reduce message volume in the nearest neighbor index. Additionally, we use experimentally validated scoring functions and thresholds to optimize classification. This is especially important when dealing with local neighborhoods of messages that may have high quantities of both safe and malicious messages.

A final point to note is that there is inevitably substantial overlap between this KNN approach and other models that may be used in production. The objective, namely learning optimal decision boundaries for classification, is the same in both cases! However, rapid updating of the nearest neighbors index is much more operationally feasible and more reliable than retraining all models multiple times a day. Furthermore, with sufficiently high thresholds, the nearest neighbors approach functions as a high-precision signature that has more degrees of freedom than hard-coded rules or heuristics.

As a result, our KNN detector works in conjunction with existing models to provide an extra layer of defense in our detection engine, thereby bolstering Abnormal’s adaptiveness and speed of response in protecting our customers from advanced attacks. Additionally, further experimental work on incorporating communication graph patterns alongside text and neural net embeddings (and handling associated challenges with normalization and scoring in this multi-dimensional space) could potentially allow us to make this approach even more effective.

To learn more about the exciting ML work happening at Abnormal, check out our Machine Learning at Abnormal page.

Image

Prevent the Attacks That Matter Most

Get the Latest Email Security Insights

Subscribe to our newsletter to receive updates on the latest attacks and new trends in the email threat landscape.

Demo 2x 1

See the Abnormal Solution to the Email Security Problem

Protect your organization from the attacks that matter most with Abnormal Integrated Cloud Email Security.

Related Posts

B 05 13 22 Spring Product Release
This quarter, the team at Abnormal launched new features to improve lateral attack detection, role-based access control (RBAC), and explainable AI. Take a deep dive into all of the latest product enhancements.
Read More
B 05 11 22 Champion Finalist
Abnormal has been selected as a Security Customer Champion finalist in the Microsoft Security Excellence Awards! Here’s a look at why.
Read More
Blog series c cover
When we raised our Series B funding 18 months ago, I promised our customers greater value, more capabilities, and better customer support. We’ve delivered on each of those promises and as we receive an even larger investment, I’m excited about how we can continue to further deliver on each of them.
Read More
B 05 09 22 Partner Community
It’s an honor to be named one of CRN’s 2022 Women of the Channel. Here’s why I appreciate the award and what I love about being a Channel Account Manager at Abnormal.
Read More
B 05 05 22 Fast Facts
Watch this short video to learn current trends and key issues in cloud email security, including how to protect your organization against modern threats.
Read More
B 05 03 22
Like all threats in the cyber threat landscape, ransomware will continue to evolve over time. This post builds on our prior research and looks at the changes we observed in the ransomware threat landscape in the first quarter of 2022.
Read More
B 04 28 22 8 Key Differences
At Abnormal, we pride ourselves on our excellent machine learning engineering team. Here are some patterns we use to distinguish between effective and ineffective ML engineers.
Read More
B 04 26 22 Webinar Re Replacing Your SEG
Learn how Microsoft 365 and Abnormal work together to provide comprehensive defense-in-depth protection in part two of our webinar recap.
Read More
Blog mitigate threats cover
Learn about the most common socially-engineered attacks and why these tactics are still so successful—despite a growing awareness from employees.
Read More
B Podcast Engineering8
In episode 8 of Abnormal Engineering Stories, Kevin interviews Saminda Wijegunawardena, an engineering leader who is no stranger to fast-growing enterprise startups.
Read More
B 04 04 22 Webinar Recap Krebs
High-impact emails are on the rise and secure email gateways (SEGs) don’t have the functionality to mitigate them. Learn how your SEG is letting you down.
Read More
B 04 19 22 Facebook Phishing
While phishing emails have long been a popular way to steal Facebook login credentials, we’ve recently seen an increase in more sophisticated phishing attacks.
Read More