Check Abnormal’s Head of Machine Learning Jeshua Bratman’s latest medium blog. Read the full blog below:
Developing a machine learning product for cybersecurity comes with unique challenges. For a bit of background, Abnormal Security’s products prevent email attacks (think phishing, business email compromise, malware, etc.) and also identify accounts that have been taken over. These attacks are clever social engineering attempts launched to steal money (sometimes in the millions) or gain access to an organization for financial theft or espionage.
Detecting attacks is hard! We’re dealing with rare events: as low as 1 in 10 million messages or sign-ins. The data is high dimensional: all the free-form content in an email and linked or attached in an email. We require bleedingly high precision and recall. And, crucially, it is adversarial: attackers are constantly trying to outsmart us.
These factors have consequences for our ML system:
To build a platform and team that can operate and improve our detection engine at high velocity, we must enable ML engineers to experiment with changes across the entire stack. This includes changes to underlying detection code, new datasets, new features, and the development of new models.
This loop is reminiscent of a software engineering CI/CD loop, but there are more moving pieces. When developing detectors, there may be new code involved, new datasets (that must be served online and offline), and new models. We must test this entire stack thoroughly, and the easier it is to test, the easier it will be to safely iterate.
Why is testing the detection stack so important? Think about what could go wrong — if, for example, we make an unintentional code change that modifies a feature used by a model (but we do not retrain the model), the effect could shift distributions, incorrectly classify, and miss a damaging attack. Since our system acts at incredibly high precision and recall, small changes can cascade to have large consequences.
Our rescoring system has three important components
For data that feeds into rescoring and model training to be effective, we have several requirements:
In addition to the automatic daily re-generation of “Golden Labels” (tagged to a particular code branch), we additionally have an Ad-Hoc rescoring pipeline allowing engineers to ask “what if” questions — that is, what happens to overall detection performance if we change one or more pieces of the system.
For example, we may not need to re-run the entire feature extraction stage if we are testing only a new model (and the downstream impact of that model). To do so, we rely on the most recent updated Golden Labels from the night before and run additional steps:
Example of an ad-hoc rescoring experiment
We can either set up two configurations, “Baseline” and “Experiment,” or run this on two different code branches. It’s up to the ML engineer to decide how to run their experiment correctly. Eventually, we would like our CI/CD system to run rescoring on stages affected by particular code changes and automatically provide metrics, but for now, it is manual.
This example configuration tests what happens when we swap out a single model.
# Baseline configuration runs model scoring and decisioning.
baseline_config = RescoreConfig(
[
MODEL_SCORING, # Evaluates ML models
DETECTION_DECISIONS, # Evaluates our detection decisions using the model scores.
]
)# Experimental setup to swap a single model.
experiment_config = RescoreConfig(
[
MODEL_SCORING,
DETECTION_DECISIONS,
],
FinalDetectorRescoreConfig(
replace_models=[ReplacementModelConfig(
model_path="/path/to/experimental/model",
model_id=ATTACK_MODEL
)]
)
)
# Runs the rescoring and delivers analytics to the user.
run_rescoring(
rescore_config=rescore_config,
baseline_config=baseline_config
)
We can use a similar system to generate model training data with experimental features.
Both automatic and ad-hoc rescoring require a lot of heavy lifting behind the scenes. We run everything on Spark, and there are a lot of tricky data engineering problems to solve to satisfy the requirements listed above, especially the time travel problem. We’ll be releasing part 2 for this story soon, describing how we built this system!
If you are interested in solving tough Applied ML engineering problems in the cybersecurity space, yes, we’re hiring!
Thanks to Justin Young, Carlos Gasperi, Kevin Lau, Dmitry Chechick, Micah Zirn, and everyone else on the detection team at Abnormal who contributed to this pipeline.