chat
expand_more

Intelligent Signatures using Embeddings and K-Nearest-Neighbors

At Abnormal Security, one of our key objectives is to build a detection engine that can continuously adapt to a changing attack landscape. As such, we want to ensure that our systems can rapidly adjust to recent and high-value messages—even with...
November 19, 2020

At Abnormal Security, one of our key objectives is to build a detection engine that can continuously adapt to a changing attack landscape. As such, we want to ensure that our systems can rapidly adjust to recent and high-value messages—even with a low number of examples. We have frequently retrained ML models to catch general trends, but we additionally would like to supplement these models with so-called signatures.

In classic email security, a signature is usually an exact match on IP, domain, content, etc. These are used as a backup to ML models to help ensure that if the system does make a mistake, either a false positive or false negative, it will not make that same mistake again.

The challenge here, though, is that this exact match approach on simple attributes of a message doesn’t actually generalize very well in catching similar messages. IP signatures work well in some circumstances, but attackers oftentimes switch up IPs or use botnets. To address this issue, we need to use “intelligent” signatures that can perform a more sophisticated fuzzy matching of sorts and automatically detect when two messages are alike.

But how can we implement this matching, if not with strictly defined rules? At Abnormal, we have many ML models that represent various dimensions of a message. For example:

  1. NLP models on the content of the messages
  2. Behavior models over the communication graph associated with a message
  3. Link models representing URLs and link content

Our approach is to produce representations of a message along the dimensions already incorporated by our models. From there, we can classify messages using a simple but effective approach utilizing the k-nearest neighbors algorithm (KNN). Specifically, each message we process is represented as a compact embedding, obtained from the layers of a deep multimodal neural network. For any new incoming message, we can find the previous messages with the closest embeddings, as measured by cosine distance or a similar metric, and classify it as malicious or not accordingly.

Image for post
Visualization of the nearest neighbors approach underlying our intelligent signatures

We can repeat this process multiple times using various embeddings and combine the results for a signature score. This is, in effect, finding messages that are nearby in various spaces (for example, nearby in content, nearby in terms of the parties involved, etc.) and using that information to make a classification decision.

Image for post
Diagram of the broader system design

While this general approach is fairly straightforward, there are several practical constraints. For example, if we were to store embeddings of every single message we have seen, we would need an extremely large amount of memory, which would pose a challenge from a systems standpoint. Furthermore, the latency of finding the nearest neighbors would be too high to support real-time applications.

To work around this challenge, we only store embeddings of recent messages, with different sampling schemes used for attacks and non-attacks to reduce message volume in the nearest neighbor index. Additionally, we use experimentally validated scoring functions and thresholds to optimize classification. This is especially important when dealing with local neighborhoods of messages that may have high quantities of both safe and malicious messages.

A final point to note is that there is inevitably substantial overlap between this KNN approach and other models that may be used in production. The objective, namely learning optimal decision boundaries for classification, is the same in both cases! However, rapid updating of the nearest neighbors index is much more operationally feasible and more reliable than retraining all models multiple times a day. Furthermore, with sufficiently high thresholds, the nearest neighbors approach functions as a high-precision signature that has more degrees of freedom than hard-coded rules or heuristics.

As a result, our KNN detector works in conjunction with existing models to provide an extra layer of defense in our detection engine, thereby bolstering Abnormal’s adaptiveness and speed of response in protecting our customers from advanced attacks. Additionally, further experimental work on incorporating communication graph patterns alongside text and neural net embeddings (and handling associated challenges with normalization and scoring in this multi-dimensional space) could potentially allow us to make this approach even more effective.

To learn more about the exciting ML work happening at Abnormal, check out our Machine Learning at Abnormal page.

Intelligent Signatures using Embeddings and K-Nearest-Neighbors

See Abnormal in Action

Get a Demo

Get the Latest Email Security Insights

Subscribe to our newsletter to receive updates on the latest attacks and new trends in the email threat landscape.

Get AI Protection for Your Human Interactions

Protect your organization from socially-engineered email attacks that target human behavior.
Request a Demo
Request a Demo

Related Posts

B Manufacturing Industry Attack Trends Blog
New data shows a surge in advanced email attacks on manufacturing organizations. Explore our research on this alarming trend.
Read More
B Dropbox Open Enrollment Attack Blog
Discover how Dropbox was exploited in a sophisticated phishing attack that leveraged AiTM tactics to steal credentials during the open enrollment period.
Read More
B AISOC
Discover how AI is transforming security operation centers by reducing noise, enhancing clarity, and empowering analysts with enriched data for faster threat detection and response.
Read More
B Microsoft Blog
Explore the latest cybersecurity insights from Microsoft’s 2024 Digital Defense Report. Discover next-gen security strategies, AI-driven defenses, and critical approaches to counter evolving threats and safeguard your organization.
Read More
B Osterman Blog
Explore five key insights from Osterman Research on how AI-driven tools are revolutionizing defensive cybersecurity by enhancing threat detection, boosting security team efficiency, and countering sophisticated cyberattacks.
Read More
B AI Native Vendors
Explore how AI-native security like Abnormal fights back against AI-powered cyberattacks, protecting your organization from human-targeted threats.
Read More