chat
expand_more

Intelligent Signatures using Embeddings and K-Nearest-Neighbors

At Abnormal Security, one of our key objectives is to build a detection engine that can continuously adapt to a changing attack landscape. As such, we want to ensure that our systems can rapidly adjust to recent and high-value messages—even with...
November 19, 2020

At Abnormal Security, one of our key objectives is to build a detection engine that can continuously adapt to a changing attack landscape. As such, we want to ensure that our systems can rapidly adjust to recent and high-value messages—even with a low number of examples. We have frequently retrained ML models to catch general trends, but we additionally would like to supplement these models with so-called signatures.

In classic email security, a signature is usually an exact match on IP, domain, content, etc. These are used as a backup to ML models to help ensure that if the system does make a mistake, either a false positive or false negative, it will not make that same mistake again.

The challenge here, though, is that this exact match approach on simple attributes of a message doesn’t actually generalize very well in catching similar messages. IP signatures work well in some circumstances, but attackers oftentimes switch up IPs or use botnets. To address this issue, we need to use “intelligent” signatures that can perform a more sophisticated fuzzy matching of sorts and automatically detect when two messages are alike.

But how can we implement this matching, if not with strictly defined rules? At Abnormal, we have many ML models that represent various dimensions of a message. For example:

  1. NLP models on the content of the messages
  2. Behavior models over the communication graph associated with a message
  3. Link models representing URLs and link content

Our approach is to produce representations of a message along the dimensions already incorporated by our models. From there, we can classify messages using a simple but effective approach utilizing the k-nearest neighbors algorithm (KNN). Specifically, each message we process is represented as a compact embedding, obtained from the layers of a deep multimodal neural network. For any new incoming message, we can find the previous messages with the closest embeddings, as measured by cosine distance or a similar metric, and classify it as malicious or not accordingly.

Image for post
Visualization of the nearest neighbors approach underlying our intelligent signatures

We can repeat this process multiple times using various embeddings and combine the results for a signature score. This is, in effect, finding messages that are nearby in various spaces (for example, nearby in content, nearby in terms of the parties involved, etc.) and using that information to make a classification decision.

Image for post
Diagram of the broader system design

While this general approach is fairly straightforward, there are several practical constraints. For example, if we were to store embeddings of every single message we have seen, we would need an extremely large amount of memory, which would pose a challenge from a systems standpoint. Furthermore, the latency of finding the nearest neighbors would be too high to support real-time applications.

To work around this challenge, we only store embeddings of recent messages, with different sampling schemes used for attacks and non-attacks to reduce message volume in the nearest neighbor index. Additionally, we use experimentally validated scoring functions and thresholds to optimize classification. This is especially important when dealing with local neighborhoods of messages that may have high quantities of both safe and malicious messages.

A final point to note is that there is inevitably substantial overlap between this KNN approach and other models that may be used in production. The objective, namely learning optimal decision boundaries for classification, is the same in both cases! However, rapid updating of the nearest neighbors index is much more operationally feasible and more reliable than retraining all models multiple times a day. Furthermore, with sufficiently high thresholds, the nearest neighbors approach functions as a high-precision signature that has more degrees of freedom than hard-coded rules or heuristics.

As a result, our KNN detector works in conjunction with existing models to provide an extra layer of defense in our detection engine, thereby bolstering Abnormal’s adaptiveness and speed of response in protecting our customers from advanced attacks. Additionally, further experimental work on incorporating communication graph patterns alongside text and neural net embeddings (and handling associated challenges with normalization and scoring in this multi-dimensional space) could potentially allow us to make this approach even more effective.

To learn more about the exciting ML work happening at Abnormal, check out our Machine Learning at Abnormal page.

Intelligent Signatures using Embeddings and K-Nearest-Neighbors

See Abnormal in Action

Get a Demo

Get the Latest Email Security Insights

Subscribe to our newsletter to receive updates on the latest attacks and new trends in the email threat landscape.

Get AI Protection for Your Human Interactions

Protect your organization from socially-engineered email attacks that target human behavior.
Request a Demo
Request a Demo

Related Posts

B Proofpoint Customer Story Blog 8
A Fortune 500 transportation and logistics leader blocked more than 6,700 attacks missed by Proofpoint and reclaimed 350 SOC hours per month by adding Abnormal to its security stack.
Read More
B Gartner MQ 2024 Announcement Blog
Abnormal Security was named a Leader in the 2024 Gartner Magic Quadrant for Email Security Platforms and positioned furthest for Completeness of Vision.
Read More
B Gift Card Scams Tricker to Spot Blog
Learn why gift card scams are becoming more difficult to identify, how cybercriminals evolve their tactics, and strategies to protect your organization.
Read More
B Offensive AI 12 16 24
Learn how AI is used in cybersecurity, what defensive AI vs. offensive AI means, and how to use defensive AI to combat offensive AI.
Read More
B Proofpoint Customer Story Blog 7
See how Abnormal's AI helped a Fortune 500 insurance provider detect 27,847 threats missed by Proofpoint and save 6,600+ hours in employee productivity.
Read More
B Cyberattack Forecast Emerging Threats Blog
Uncover the latest email threats and strategies to strengthen your cybersecurity and prepare for 2025.
Read More