Calibrating Classifiers in Reality
Abnormal's fundamental job is to detect malicious emails like phishing and business email compromise attacks and other malicious events, such as suspicious sign-ins that indicate an account has been hacked. To do so, we have a complex web of features, sub-models, and classification models that decide on whether an event is malicious or not. Once we’ve built a model we must turn it into a classifier by selecting a threshold. This sounds easy, but there are many tricky details.
Using Classifiers to Predict Email Attacks
We’ll focus on the email attack detection case for this discussion and simplify it down to the core classification problem. The general approach is straightforward; we want to start with a probabilistic model M(X) that predicts attack probability given our features X.
M(X) = P(attack | X)
Other articles discuss this attack model itself (here and here), so we won’t discuss that in this article. Here we are interested in what happens after we’ve built a model.
Once we have a probabilistic model, we could easily build an attack detector using thresholding to define a decision boundary.
predict attack if M(X) > threshold
Such a classifier would have nice properties. For example, the precision will equal the threshold, and the recall is recoverable from a precision/recall curve. This property would allow us to easily trade between precision and recall for our business needs by sliding a threshold up and down. This is important because our clients are sensitive to both false positives and false negatives, so tuning our detectors precisely is crucial to maintaining a high-quality product.
But there’s a problem: In reality, we rarely find that trained models produce accurate probabilities. This is a problem of miscalibration. We will delve into the cause of this miscalibration in a bit, but usually, it is due to a combination of the learning algorithm (i.e. neural nets may not produce probabilities) and, more importantly, skewed data and labels distributions in the training dataset. But first, let’s focus on why miscalibration is a problem for an ML product.
As we said, the model does not line up to the probability distribution:
M(X) ≠ P(attack | X)
We assume the model is still correlated with probability, if it is a good model. When it is not a probability, it throws off our thresholding strategy. Imagine we create a classifier as before:
predict attack if M(X) > threshold
This threshold no longer lines up to an expected precision, and therefore, we must tune thresholds to meet desired performance characteristics. To appreciate why this is not ideal in practice, take Abnormal’s use case. We are carefully tuning our detectors to prevent email attacks. Often we may need to move a false positive rate from X% to Y%. To do so, we must go back to our data and solve for the right threshold. If we want to build simple control knobs or even an automated control system, we do not want to require this manual translation from the desired performance to a threshold.
In practice this caused some big issues, primarily:
- For each new model, we needed to carefully tune thresholds. This made it very hard to compare model A to model B on an even playing field. AUC gives one evaluation but tends to evaluate outside the operating range and we need more precise evaluations at particular thresholds. This difficulty slowed down our experimentation and pace of launching new models
- We needed to control thresholds separately for each client for each model. As our client base grew, this became increasingly problematic. We knew we had to either better bake the client particulars into the model directly or somehow set thresholds automatically from data.
The ideal solution is to calibrate our model by adding an extra layer to do this translation automatically:
Calibrator(M(X)) = P(attack | X)
A common approach is to use a regression from the model scores to an empirical probability to estimate the true probability on a calibration dataset (we’ll call this CalibrationDataset). If we do have a good calibration dataset, a common approach is to build a simple regression function, for example, isotonic regression, to re-map into probability space.
The idea of isotonic regression is to partition the range of the model’s predictions into N buckets. For each bucket, estimate the expected ratio of positive to negative class, then draw a line between the buckets. There are many approaches to improve this by drawing a smoother function between the points instead of piecewise linear as depicted. For example, you could interpolate with linear regression or splines. These are simple details, but the difficult part is producing the CalibrationDataset on which this all depends.
Sources of Error in Calibration Data
As we discussed above, to calibrate a classifier (or train a model that is predicting probability correctly in the first place) we need to produce a dataset that is distributionally equivalent to the true probability of the online production system. Let’s imagine we have some true probability distribution of attacks.
P_true(attack|features)
At Abnormal, we know our training data is quite different from this distribution. One reason for this is that we heavily subsample negative examples (safe emails) and additionally use many positive examples (attack emails) from different time ranges. This is because we want to include the history of all attacks in our models, but we cannot include the history of all safe messages due to the enormity of the data. We have 100000x or more safe emails than attacks since the base rate is very low.
Additionally, we do not want to force ML engineers to attempt to produce a representative distribution every time they train a model. It may slow down iteration or cause other issues due to missing data, the need to experiment with filtering functions, etc. Instead, we prefer to train uncalibrated models and then fix them afterward with calibration.
That leaves us with the same problem, how do we produce a CalibrationDataset drawn from the true distribution?
First, let’s enumerate some types of distributional errors we commonly encounter:
- Sampling distributional errors. The most obvious errors are from how we sample the positive and negative examples. We may select only 10% of negative samples since they are so prevalent.
- Label distributional errors. We do not and cannot label every message in a dataset. This means any calibration dataset will be only partially labeled.
- Client distributional errors. At Abnormal, we have many different clients in different industries and with particular characteristics. While there are other potential complicating variables, the client is a particularly impactful variable. We may have a distribution on some clients that do not translate to new clients. Ideally, as our models learn more across a larger sampling of clients, this issue will continue to lessen but it must be taken into account.
- Data quality distributional errors. We may also have fundamental differences in feature values in our calibration dataset due to engineering data quality. For example, some features may not be possible to backstate for certain samples.
- Time distributional shifts. Any calibration dataset will be earlier in time than the distribution on which we will be applying the model. There are fundamental time-based shifts due to naturally changing email traffic. Additionally, due to the adversarial nature of the problem, we expect the attack distribution patterns to change as adversaries evolve their strategy.
Attempting to Correct for These Errors
It’s important to understand each of these possible sources of error and think through others if necessary. Once we understand the errors, we can build correction mechanisms. Ideally, we also produce small datasets that help us evaluate how much our correction methods succeed in this task. If we can correct each error source, we can ideally produce a good calibrator.
Here are some high-level ideas on how to correct for each error:
- Correcting for sampling errors. This can be done by understanding the exact mechanism used to sample in the first place and reversing it. As a simple example, if we uniformly sampled 10% of the negative class, any statistic of that dataset will have probabilities shifted by a factor of 10 and we can shift them back.
- Correcting labeling error. For various reasons we cannot and do not label an entire dataset, but we can control exactly what we do and do not label within a dataset. We can use the labels selection criterion (i.e. how we choose which data to label in the first place) and sample (i.e. weigh labeled samples above unlabeled ones) weighting to help correct errors caused by unlabeled samples in a dataset.
- Correcting per-client error. We can manually learn marginal distributions for a customer and shift our calibrator to make up for these. Or, if we want to get more sophisticated, we can attempt to model clients through some featurization (for example the client’s industry or size) and learn marginal distributions across those features.
- Correcting data quality error. This error is easy to measure as we can compare distributions of model scores and features between our online system and historical batch data and then use that measurement to shift our distribution as needed.
- Correcting time distributional shifts. This is the hardest to correct for. One possibility is to attempt to model the shift with a time series model. Another method is to monitor and correct for distributional shifts with an online system measuring drift over time. Going into details here would be a blog post in its own right.
Engineering an Imperfect Solution
Unfortunately, even after significant work on this problem attempting to correct all these errors, we failed to produce a perfect all-around calibrator. Too many distributional errors persisted.
This sort of setback is common in ML engineering. Rather than give up, we instead thought creatively. We asked the question: Do we actually need a calibrated classifier?
Well, yes and no. A calibrated classifier is sufficient, but not necessary. That is, could we loosen the requirements? To answer this question, we listed out the actual desiderata for our classifier:
- Calibration matters only in a specific operating range. In reality, the model only needs to make good predictions at very high precision, and we care much less about calibration lower down in the PR curve because we are only remediating attacks for which we are quite confident, as almost all emails are safe. This is the key insight.
- Score stability property. The range of scores emitted by a classifier should be relatively stable across clients and across versions of the trained model. For example, we would like the score of 0.95 to mean roughly the same thing when we roll out a new model or on a newly onboarded client. Also, the score should be smooth: moving a threshold by some amount should smoothly affect the volume of flagged messages and the precision. If we have a stable score, we can more easily build systems on top of the model, such as a control system.
- Ranking property. Perhaps obvious, but the calibrator needs to generally rank messages from least likely to most likely to be attacks. That is, it should have a high AUC.
Simplifying the problem made it more tractable. We ended up building a calibration system that has the following properties:
- Calibration is correct at about 0.95 precision.
- Performance is not well calibration below this point, but it is relatively stable between clients and on new versions of the model.
- Very low scores are uncalibrated and we do not trust the model much at all below low confidence predictions. Luckily, we rarely worry about performance low in the curve because we are using this classifier only to predict the positive attack class.
Below is an illustration of stabilized predictor and how it might match up to the ideal calibrated predictor in some places and not others.
Conclusion
Once we developed this calibration method, it made many tasks easier. Before calibration, we had to manage thresholds very carefully across clients and between models. Now, there is a single threshold to control for each model and this threshold is stable from one model to another and one client to another. For example, 0.95 means the same thing between models.
This has increased development speed and made it much easier to run experiments. Trusting this calibration method has also removed many moving parts an ML engineer must think about when comparing one model to another.
Key takeaways include:
- Start with a theoretical framework for a problem, but don’t be afraid to cut corners and simplify this framework to make progress. For example, the key insight that our model only needs to be calibrated at the top scores helped dramatically simplify the problem.
- Reducing degrees of freedom helps with productivity. In our case, manually controlling thresholds per client or for new models could eke out slightly better performance, but sticking to a fixed calibration method and a single threshold allows easier progress. This is because the ML engineer does not need to think about the calibration problem for every model on top of the core feature and model improvement tasks.
- There are many steps beyond the model itself to build a good product on top of ML. Do not focus solely on getting the best AUC... Also think about managing thresholds, running experiments, iteration, and so on.
If these problems interest you, check out our open opportunities because we're hiring!