Intelligent Signatures using Embeddings and K-Nearest-Neighbors

November 19, 2020

At Abnormal Security, one of our key objectives is to build a detection engine that can continuously adapt to a changing attack landscape. As such, we want to ensure that our systems can rapidly adjust to recent and high-value messages—even with a low number of examples. We have frequently retrained ML models to catch general trends, but we additionally would like to supplement these models with so-called signatures.

In classic email security, a signature is usually an exact match on IP, domain, content, etc. These are used as a backup to ML models to help ensure that if the system does make a mistake, either a false positive or false negative, it will not make that same mistake again.

The challenge here, though, is that this exact match approach on simple attributes of a message doesn’t actually generalize very well in catching similar messages. IP signatures work well in some circumstances, but attackers oftentimes switch up IPs or use botnets. To address this issue, we need to use “intelligent” signatures that can perform a more sophisticated fuzzy matching of sorts and automatically detect when two messages are alike.

But how can we implement this matching, if not with strictly defined rules? At Abnormal, we have many ML models that represent various dimensions of a message. For example:

  1. NLP models on the content of the messages
  2. Behavior models over the communication graph associated with a message
  3. Link models representing URLs and link content

Our approach is to produce representations of a message along the dimensions already incorporated by our models. From there, we can classify messages using a simple but effective approach utilizing the k-nearest neighbors algorithm (KNN). Specifically, each message we process is represented as a compact embedding, obtained from the layers of a deep multimodal neural network. For any new incoming message, we can find the previous messages with the closest embeddings, as measured by cosine distance or a similar metric, and classify it as malicious or not accordingly.

Image for post
Visualization of the nearest neighbors approach underlying our intelligent signatures

We can repeat this process multiple times using various embeddings and combine the results for a signature score. This is, in effect, finding messages that are nearby in various spaces (for example, nearby in content, nearby in terms of the parties involved, etc.) and using that information to make a classification decision.

Image for post
Diagram of the broader system design

While this general approach is fairly straightforward, there are several practical constraints. For example, if we were to store embeddings of every single message we have seen, we would need an extremely large amount of memory, which would pose a challenge from a systems standpoint. Furthermore, the latency of finding the nearest neighbors would be too high to support real-time applications.

To work around this challenge, we only store embeddings of recent messages, with different sampling schemes used for attacks and non-attacks to reduce message volume in the nearest neighbor index. Additionally, we use experimentally validated scoring functions and thresholds to optimize classification. This is especially important when dealing with local neighborhoods of messages that may have high quantities of both safe and malicious messages.

A final point to note is that there is inevitably substantial overlap between this KNN approach and other models that may be used in production. The objective, namely learning optimal decision boundaries for classification, is the same in both cases! However, rapid updating of the nearest neighbors index is much more operationally feasible and more reliable than retraining all models multiple times a day. Furthermore, with sufficiently high thresholds, the nearest neighbors approach functions as a high-precision signature that has more degrees of freedom than hard-coded rules or heuristics.

As a result, our KNN detector works in conjunction with existing models to provide an extra layer of defense in our detection engine, thereby bolstering Abnormal’s adaptiveness and speed of response in protecting our customers from advanced attacks. Additionally, further experimental work on incorporating communication graph patterns alongside text and neural net embeddings (and handling associated challenges with normalization and scoring in this multi-dimensional space) could potentially allow us to make this approach even more effective.

To learn more about the exciting ML work happening at Abnormal, check out our Machine Learning at Abnormal page.

Previous
Blog multi tenant criss
Abnormal Abuse Mailbox from Abnormal Security is a product that is designed to collect, collate, and automate the handling of phishing, spam, and other user-reported messages. With Abnormal Abuse Mailbox, SOC teams report saving multiple hours each day...
Read More
Next
Blog logo wavy lines
When we founded Abnormal Security more than two and a half years ago, we met with 50 top CIOs and CISOs who told us two things: they needed a solution to stop a novel set of cyberattacks that increasingly bypassed legacy email security solutions, and they needed it...
Read More

Related Posts

B 10 15 21
With Detection 360, submission to threat containment just got 94% faster, making it incredibly easy for customers to submit false positives or missed attacks, and get real-time updates from Abnormal on investigation, conclusion, and remediation.
Read More
Extortion blog cover
Unfortunately, physically threatening extortion attempts sent via email continue to impact companies and public institutions when received—disrupting business, intimidating employees, and occasioning costly responses from public safety.
Read More
Blog engineering cybersecurity careers
Cybersecurity Careers Awareness Week is a great opportunity to explore key careers in information security, particularly as there are an estimated 3.1 million unfilled cybersecurity jobs. This disparity means that cybercriminals are taking advantage of the situation, sending more targeted attacks and seeing greater success each year.
Read More
Blog hiring cybersecurity leaders
As with every equation, there are always two sides and while it can be easy to blame users when they fall victim to scams and attacks, we also need to examine how we build and staff security teams.
Read More
Cover automated ato
With an increase in threat actor attention toward compromising accounts, Abnormal is focused on protecting our customers from this potentially high-profile threat. We are pleased to announce that our new Automated Account Takeover (ATO) Remediation functionality is available.
Read More
Email spoofing cover
Email spoofing is a common form of phishing attack designed to make the recipient believe that the message originates from a trusted source. A spoofed email is more than just a nuisance—it’s a malicious communication that poses a significant security threat.
Read More
Cover cybersecurity month kickoff
It’s time to turn the page on the calendar, and we are finally in October—the one month of the year when the spooky becomes reality. October is a unique juncture in the year as most companies are making the mad dash to year-end...
Read More
Ices announcement cover
Abnormal ICES offers all-in-one email security, delivering a precise approach to combat the full spectrum of email-borne threats. Powered by behavioral AI technology and deeply integrated with Microsoft 365...
Read More
Account takeover cover
Account takeovers are one of the biggest threats facing organizations of all sizes. They happen when cybercriminals gain legitimate login credentials and then use those credentials to send more attacks, acting like the person...
Read More
Blog podcast green cover
Many companies aspire to be customer-centric, but few find a way to operationalize customer-centricity into their team’s culture. As a 3x SaaS startup founder, most recently at Orum, and a veteran of Facebook and Palantir, Ayush Sood...
Read More
Blog attack atlassian cover
Credential phishing links are most commonly sent by email, and they typically lead to a website that is designed to look like common applications—most notably Microsoft Office 365, Google, Amazon, or other well-known...
Read More
Blog podcast purple cover
Working at hyper-growth startups usually means that unreasonable expectations will be thrust on individuals and teams. Demanding timelines, goals, and expectations can lead to high pressure, stress, accountability, and ultimately, extraordinary growth and achievements.
Read More