Re-Scoring an ML Detection Engine on Past Attacks

March 30, 2021

Developing a machine learning product for cybersecurity comes with unique challenges. For a bit of background, Abnormal Security’s products prevent email attacks—think credential phishing, business email compromise, and malware—and also identify email accounts that have been taken over. These attacks are clever social engineering attempts launched to steal money (sometimes in the millions) or gain access to an organization for financial theft or espionage.

Detecting attacks is hard! We’re dealing with rare events—as low as 1 in 10 million messages or sign-ins. The data is high dimensional—all the free-form content in an email and linked or attached in an email. We require bleedingly high precision and recall. And, crucially, it is adversarial—attackers are constantly trying to outsmart us.

These factors have consequences for our ML system:

  1. The adversarial nature of the problem means the attack distribution is constantly shifting, so we must keep improving models. Attackers are often even using ML systems of their own to adapt their attacks!
  2. Due to the rarity of attacks, we must retain every precious sample. We cannot afford to throw out older attacks simply because they do not have the latest features. We must instead re-compute these features.
  3. Due to high dimensionality and volume, it is a big-data problem and requires efficient distributed data processing.

To build a platform and team that can operate and improve our detection engine at high velocity, we must enable ML engineers to experiment with changes across the entire stack. This includes changes to underlying detection code, new datasets, new features, and the development of new models.

Iteration for ML engineers working on detection problems at Abnormal. A lot goes into improving the detection stack, and it’s crucial to have a robust testing and evaluation suite that can evaluate across new models, datasets, new code, etc.

This loop is reminiscent of a software engineering CI/CD loop, but there are more moving pieces. When developing detectors, there may be new code involved, new datasets that must be served online and offline, and new models. We must test this entire stack thoroughly, and the easier it is to test, the easier it will be to safely iterate.

Why is testing the detection stack so important? Think about what could go wrong... if, for example, we make an unintentional code change that modifies a feature used by a model but we do not retrain the model, the effect could shift distributions, incorrectly classify, and miss a damaging attack. Since our system acts at incredibly high precision and recall, small changes can cascade to have large consequences.

Re-generating an Updated Labeled Dataset

Our rescoring system has three important components

  1. Golden Labels: The set of all labeled messages with features updated with the entire detection stem including recent code, joined data, and prediction models. We affectionately call this dataset “Golden Labels.” It’s the gold standard label set, and it is also precious because it contains all attack examples from the past. This dataset feeds into the following two processes.
  2. Rescoring: Process to evaluate the entire detection system to produce performance analytics like precision and recall. We call this “rescored recall.” For example, we measure what percentage of historical attacks we would catch today with the current detection system. Note: This is different than precision and recall evaluation on individual models on an evaluation set because we are testing an entire stack that includes multiple models, decision logic, feature extraction code, and datasets.
  3. Model Training: Like rescoring, re-training models requires up-to-date Golden Labels data. We must have correct and unbiased features to train our models so that the data observed when training matches up to the data at inference time. The features that feed into model training are organized as a DAG of dependencies where some features rely on other features or models.
Various steps in the model training and rescoring are dependent on the most recent datasets, code, models, and labeled samples. Since we don’t want to lose old attack samples, we must re-hydrate and re-extract features for all these old samples. This rescoring pipeline creates an updated dataset called “Golden Labels” used to ensure accurate data on the historical dataset.

Requirements for Data Golden Labels Re-Generation

For data that feeds into rescoring and model training to be effective, we have several requirements:

  1. (rescore requirement) - Should be able to update historical samples with the latest attribute extraction (1) code, (2) joined datasets, and (3) model predictions and evaluate proper and trustable analytics.
  2. (unbiased feature requirement) - Re-computed features on old samples accurately reflect how those features would appear if that sample were encountered today.
  3. (time-travel requirement) - To produce unbiased aggregate features, the system must perform time-travel: evaluate samples as they would have looked at the time of the attack. Time travel ensures unbiased data and also helps us avoid future leakage: avoiding bringing any data from the future (to a past sample) that leaks labels into training features.
  4. (developer effectiveness) - ML engineers must be able to easily develop, integrate, and deploy their changes to any area of the detection stack. Must be efficient enough to run frequently to retrain and evaluate changes and ad-hoc evaluations to answer “what-if” type questions.

Ad-Hoc Rescoring Experiments

In addition to the automatic daily re-generation of “Golden Labels” tagged to a particular code branch, we additionally have an Ad-Hoc rescoring pipeline allowing engineers to ask “what if” questions—that is, what happens to overall detection performance if we change one or more pieces of the system.

For example, we may not need to re-run the entire feature extraction stage if we are testing only a new model and the downstream impact of that model. To do so, we rely on the most recent updated Golden Labels from the night before and run additional steps:

Partial re-evaluation allows us to test changes on parts of the stack without re-extracting all features.

Example of an Ad-Hoc Rescoring Experiment

We can either set up two configurations, “Baseline” and “Experiment,” or run this on two different code branches. It’s up to the ML engineer to decide how to run their experiment correctly. Eventually, we would like our CI/CD system to run rescoring on stages affected by particular code changes and automatically provide metrics, but for now, it is manual.

This example configuration tests what happens when we swap out a single model.

# Baseline configuration runs model scoring and decisioning.
baseline_config = RescoreConfig(
 MODEL_SCORING, # Evaluates ML models
 DETECTION_DECISIONS, # Evaluates our detection decisions using the model scores.
)# Experimental setup to swap a single model.
experiment_config = RescoreConfig(

# Runs the rescoring and delivers analytics to the user.

We can use a similar system to generate model training data with experimental features.

Running Rescoring Efficiently

Both automatic and ad-hoc rescoring require a lot of heavy lifting behind the scenes. We run everything on Spark, and there are a lot of tricky data engineering problems to solve to satisfy the requirements listed above, especially the time travel problem. Read Part 2 of this Re-Scoring an ML Detection Engine of Past Attacks series for information on how we built this system.

And if you're interested in solving tough applied ML engineering problems in the cybersecurity space, we’re hiring!

Thanks to Justin Young, Carlos Gasperi, Kevin Lau, Dmitry Chechick, Micah Zirn, and everyone else on the detection team at Abnormal who contributed to this pipeline.

Blog black white arrows
Machine learning engineering is hard, especially when developing products at high velocity, as is the case for us at Abnormal Security. Typical software engineering lifecycles often fail when developing ML systems.
Read More
Blog green circle
You’ll find similar characteristics in BEC that you will in VEC. A common trait of BEC is it does not contain malware or malicious URLs, and due to that technique, it is able to bypass conventional email security measures like SEGs. BEC relies...
Read More

Related Posts

B 10 15 21
With Detection 360, submission to threat containment just got 94% faster, making it incredibly easy for customers to submit false positives or missed attacks, and get real-time updates from Abnormal on investigation, conclusion, and remediation.
Read More
Extortion blog cover
Unfortunately, physically threatening extortion attempts sent via email continue to impact companies and public institutions when received—disrupting business, intimidating employees, and occasioning costly responses from public safety.
Read More
Blog engineering cybersecurity careers
Cybersecurity Careers Awareness Week is a great opportunity to explore key careers in information security, particularly as there are an estimated 3.1 million unfilled cybersecurity jobs. This disparity means that cybercriminals are taking advantage of the situation, sending more targeted attacks and seeing greater success each year.
Read More
Blog hiring cybersecurity leaders
As with every equation, there are always two sides and while it can be easy to blame users when they fall victim to scams and attacks, we also need to examine how we build and staff security teams.
Read More
Cover automated ato
With an increase in threat actor attention toward compromising accounts, Abnormal is focused on protecting our customers from this potentially high-profile threat. We are pleased to announce that our new Automated Account Takeover (ATO) Remediation functionality is available.
Read More
Email spoofing cover
Email spoofing is a common form of phishing attack designed to make the recipient believe that the message originates from a trusted source. A spoofed email is more than just a nuisance—it’s a malicious communication that poses a significant security threat.
Read More
Cover cybersecurity month kickoff
It’s time to turn the page on the calendar, and we are finally in October—the one month of the year when the spooky becomes reality. October is a unique juncture in the year as most companies are making the mad dash to year-end...
Read More
Ices announcement cover
Abnormal ICES offers all-in-one email security, delivering a precise approach to combat the full spectrum of email-borne threats. Powered by behavioral AI technology and deeply integrated with Microsoft 365...
Read More
Account takeover cover
Account takeovers are one of the biggest threats facing organizations of all sizes. They happen when cybercriminals gain legitimate login credentials and then use those credentials to send more attacks, acting like the person...
Read More
Blog podcast green cover
Many companies aspire to be customer-centric, but few find a way to operationalize customer-centricity into their team’s culture. As a 3x SaaS startup founder, most recently at Orum, and a veteran of Facebook and Palantir, Ayush Sood...
Read More
Blog attack atlassian cover
Credential phishing links are most commonly sent by email, and they typically lead to a website that is designed to look like common applications—most notably Microsoft Office 365, Google, Amazon, or other well-known...
Read More
Blog podcast purple cover
Working at hyper-growth startups usually means that unreasonable expectations will be thrust on individuals and teams. Demanding timelines, goals, and expectations can lead to high pressure, stress, accountability, and ultimately, extraordinary growth and achievements.
Read More