chat
expand_more

Intelligent Signatures using Embeddings and K-Nearest-Neighbors

At Abnormal Security, one of our key objectives is to build a detection engine that can continuously adapt to a changing attack landscape. As such, we want to ensure that our systems can rapidly adjust to recent and high-value messages—even with...
November 19, 2020

At Abnormal Security, one of our key objectives is to build a detection engine that can continuously adapt to a changing attack landscape. As such, we want to ensure that our systems can rapidly adjust to recent and high-value messages—even with a low number of examples. We have frequently retrained ML models to catch general trends, but we additionally would like to supplement these models with so-called signatures.

In classic email security, a signature is usually an exact match on IP, domain, content, etc. These are used as a backup to ML models to help ensure that if the system does make a mistake, either a false positive or false negative, it will not make that same mistake again.

The challenge here, though, is that this exact match approach on simple attributes of a message doesn’t actually generalize very well in catching similar messages. IP signatures work well in some circumstances, but attackers oftentimes switch up IPs or use botnets. To address this issue, we need to use “intelligent” signatures that can perform a more sophisticated fuzzy matching of sorts and automatically detect when two messages are alike.

But how can we implement this matching, if not with strictly defined rules? At Abnormal, we have many ML models that represent various dimensions of a message. For example:

  1. NLP models on the content of the messages
  2. Behavior models over the communication graph associated with a message
  3. Link models representing URLs and link content

Our approach is to produce representations of a message along the dimensions already incorporated by our models. From there, we can classify messages using a simple but effective approach utilizing the k-nearest neighbors algorithm (KNN). Specifically, each message we process is represented as a compact embedding, obtained from the layers of a deep multimodal neural network. For any new incoming message, we can find the previous messages with the closest embeddings, as measured by cosine distance or a similar metric, and classify it as malicious or not accordingly.

Image for post
Visualization of the nearest neighbors approach underlying our intelligent signatures

We can repeat this process multiple times using various embeddings and combine the results for a signature score. This is, in effect, finding messages that are nearby in various spaces (for example, nearby in content, nearby in terms of the parties involved, etc.) and using that information to make a classification decision.

Image for post
Diagram of the broader system design

While this general approach is fairly straightforward, there are several practical constraints. For example, if we were to store embeddings of every single message we have seen, we would need an extremely large amount of memory, which would pose a challenge from a systems standpoint. Furthermore, the latency of finding the nearest neighbors would be too high to support real-time applications.

To work around this challenge, we only store embeddings of recent messages, with different sampling schemes used for attacks and non-attacks to reduce message volume in the nearest neighbor index. Additionally, we use experimentally validated scoring functions and thresholds to optimize classification. This is especially important when dealing with local neighborhoods of messages that may have high quantities of both safe and malicious messages.

A final point to note is that there is inevitably substantial overlap between this KNN approach and other models that may be used in production. The objective, namely learning optimal decision boundaries for classification, is the same in both cases! However, rapid updating of the nearest neighbors index is much more operationally feasible and more reliable than retraining all models multiple times a day. Furthermore, with sufficiently high thresholds, the nearest neighbors approach functions as a high-precision signature that has more degrees of freedom than hard-coded rules or heuristics.

As a result, our KNN detector works in conjunction with existing models to provide an extra layer of defense in our detection engine, thereby bolstering Abnormal’s adaptiveness and speed of response in protecting our customers from advanced attacks. Additionally, further experimental work on incorporating communication graph patterns alongside text and neural net embeddings (and handling associated challenges with normalization and scoring in this multi-dimensional space) could potentially allow us to make this approach even more effective.

To learn more about the exciting ML work happening at Abnormal, check out our Machine Learning at Abnormal page.

Intelligent Signatures using Embeddings and K-Nearest-Neighbors

See Abnormal in Action

Get a Demo

Get the Latest Email Security Insights

Subscribe to our newsletter to receive updates on the latest attacks and new trends in the email threat landscape.

Get AI Protection for Your Human Interactions

Protect your organization from socially-engineered email attacks that target human behavior.
Request a Demo
Request a Demo

Related Posts

B MKT628 Cyber Savvy Social Images
Discover key insights from seasoned cybersecurity professional Nicholas Schopperth, CISO at Dayton Children’s Hospital.
Read More
B Podcast Blog
Discover 'SOC Unlocked,' Abnormal Security's new podcast featuring host Mick Leach and cybersecurity expert guests like Jeremy Ventura, Dave Kennedy, and Mick Douglas.
Read More
B 07 22 24 MKT624 Images for Paris Olympics Blog
Threat actors are targeting French businesses ahead of the Paris 2024 Olympics. Learn how they're capitalizing on the event and how to protect your organization.
Read More
B Cross Platform ATO
Cross-platform account takeover is an attack where one compromised account is used to access other accounts. Learn about four real-world examples: compromised email passwords, hijacked GitHub accounts, stolen AWS credentials, and leaked Slack logins.
Read More
B Why MFA Alone Will No Longer Suffice
Explore why account takeover attacks pose a major threat to enterprises and why multi-factor authentication (MFA) alone isn't enough to prevent them.
Read More
B NLP
Learn how Abnormal uses natural language processing or NLP to protect organizations from phishing, account takeovers, and more.
Read More