Intelligent Signatures using Embeddings and K-Nearest-Neighbors

November 19, 2020

At Abnormal Security, one of our key objectives is to build a detection engine that can continuously adapt to a changing attack landscape. As such, we want to ensure that our systems can rapidly adjust to recent and high-value messages—even with a low number of examples. We have frequently retrained ML models to catch general trends, but we additionally would like to supplement these models with so-called signatures.

In classic email security, a signature is usually an exact match on IP, domain, content, etc. These are used as a backup to ML models to help ensure that if the system does make a mistake, either a false positive or false negative, it will not make that same mistake again.

The challenge here, though, is that this exact match approach on simple attributes of a message doesn’t actually generalize very well in catching similar messages. IP signatures work well in some circumstances, but attackers oftentimes switch up IPs or use botnets. To address this issue, we need to use “intelligent” signatures that can perform a more sophisticated fuzzy matching of sorts and automatically detect when two messages are alike.

But how can we implement this matching, if not with strictly defined rules? At Abnormal, we have many ML models that represent various dimensions of a message. For example:

NLP models on the content of the messages
Behavior models over the communication graph associated with a message
Link models representing URLs and link content

Our approach is to produce representations of a message along the dimensions already incorporated by our models. From there, we can classify messages using a simple but effective approach utilizing the k-nearest neighbors algorithm (KNN). Specifically, each message we process is represented as a compact embedding, obtained from the layers of a deep multimodal neural network. For any new incoming message, we can find the previous messages with the closest embeddings, as measured by cosine distance or a similar metric, and classify it as malicious or not accordingly.

Image for post — Visualization of the nearest neighbors approach underlying our intelligent signatures

We can repeat this process multiple times using various embeddings and combine the results for a signature score. This is, in effect, finding messages that are nearby in various spaces (for example, nearby in content, nearby in terms of the parties involved, etc.) and using that information to make a classification decision.

While this general approach is fairly straightforward, there are several practical constraints. For example, if we were to store embeddings of every single message we have seen, we would need an extremely large amount of memory, which would pose a challenge from a systems standpoint. Furthermore, the latency of finding the nearest neighbors would be too high to support real-time applications.

To work around this challenge, we only store embeddings of recent messages, with different sampling schemes used for attacks and non-attacks to reduce message volume in the nearest neighbor index. Additionally, we use experimentally validated scoring functions and thresholds to optimize classification. This is especially important when dealing with local neighborhoods of messages that may have high quantities of both safe and malicious messages.

A final point to note is that there is inevitably substantial overlap between this KNN approach and other models that may be used in production. The objective, namely learning optimal decision boundaries for classification, is the same in both cases! However, rapid updating of the nearest neighbors index is much more operationally feasible and more reliable than retraining all models multiple times a day. Furthermore, with sufficiently high thresholds, the nearest neighbors approach functions as a high-precision signature that has more degrees of freedom than hard-coded rules or heuristics.

As a result, our KNN detector works in conjunction with existing models to provide an extra layer of defense in our detection engine, thereby bolstering Abnormal’s adaptiveness and speed of response in protecting our customers from advanced attacks. Additionally, further experimental work on incorporating communication graph patterns alongside text and neural net embeddings (and handling associated challenges with normalization and scoring in this multi-dimensional space) could potentially allow us to make this approach even more effective.

To learn more about the exciting ML work happening at Abnormal, check out our Machine Learning at Abnormal page.

Get AI Protection for Your Human Interactions

Protect your organization from socially-engineered email attacks that target human behavior.

Request a Demo

Data & Trends

Mission Interrupted: Nonprofits Face a Rising Wave of Email Attacks

Advanced email attacks on nonprofits surged 35% year-over-year. Learn why cybercriminals are targeting the sector and how to stay protected.

B PDF Annotations Mask Malicious QR Codes Blog

Attack Stories

Hiding in Plain Sight: How Attackers Use PDF Annotations to Mask Malicious QR Codes

Attackers are exploiting PDF annotations to disguise phishing QR codes, bypassing security and deceiving users. Learn how this sophisticated threat works.

Credential Phishing

The Most Common Types of Phishing Attacks and Their Impact

Discover the most common types of phishing attacks and their impacts. Learn how cybercriminals exploit deception to compromise security and steal sensitive information.

Product

Fueling Stronger Security: How Abnormal Filled Gaps Left by Proofpoint for a Leading Fuel and Convenience Retailer

Learn how a trusted fuel and convenience retailer blocked 2,300+ attacks missed by Proofpoint and reclaimed 300+ employee hours per month by adding Abnormal.

Business Email Compromise

BEC in the Age of AI: The Growing Threat

Business email compromise (BEC) has seen growth due to criminals adopting AI tools. See the trends and discover how to protect your business from cybercriminals.

Account Takeover

Account Compromise Arms Race: How Threat Actors Evade Phish-Resistant Security Tools

Discover how cybercriminals are adapting to phish-resistant authentication, using session hijacking, info-stealer malware, and consent phishing to bypass security controls.

Intelligent Signatures using Embeddings and K-Nearest-Neighbors

See Abnormal in Action

Get the Latest Email Security Insights

Get AI Protection for Your Human Interactions

Related Posts