Model Understanding with Feature Importance
Here at Abnormal, our machine learning models help us spot trends and abnormalities in customer data in order to catch and prevent cyberattacks. Maintaining and improving these models requires a deep understanding of how they operate and make decisions. One way to build this understanding is to analyze how each model uses the features we feed it.
An Overview of Feature Permutation
A simple algorithm we can use to accomplish this is called feature permutation. The algorithm proceeds as follows:
Given a model, a dataset, and a feature F, compute the baseline model performance over the dataset.
For each sample in the dataset, randomly choose another sample from the dataset and swap the values of F between these samples.
The importance of feature F is the difference between the model performance over the permuted dataset and the baseline model performance.
Intuitively, the feature permutation algorithm allows us to understand the degree to which a particular feature's value contributes to the model's final decision. Let’s show an example to illustrate this.
Feature Permutation Case Study: Ensemble Models
Our email attack detection systems at Abnormal Security have multiple layers. Upstream machine learning models generate predictions based on a variety of message attributes, and an ensemble model makes the final decision. This ensemble is trained on the predictions of the individual upstream models, and we retrain it whenever we add or change one of these models. These changes can dramatically change the strategy that the ensemble uses to make decisions.
For example, suppose we have an ensemble that is trained on three upstream models A,B,C, where models B and C are substantially more important.
Now suppose that we add new features to model A and retrain the ensemble. There are multiple potential impacts of this change. One possibility is that all three models become equally important:
Another possibility is that model A increases in importance at the expense of model B:
This is a likely outcome if the features we added to model A were also used in model B.
An Overview of Feature Group Permutation
One pitfall of feature permutation is that it doesn't play nicely with correlated features. If we have N features from different versions of the same model or M counts of very similar quantities, the sum of the importances of the individual counts can be an underestimate of the importance of the group of features. Luckily, we can easily get around this by using a similar algorithm, called feature group permutation. This algorithm computes the importance of a group of features rather than a single feature and proceeds as follows:
Given a model, a dataset, and a group of features G, compute the baseline model performance over the dataset.
For each sample in the dataset, randomly choose another sample from the dataset and swap the values of each feature F in G between these samples.
The importance of group G is the difference between the model performance over the permuted dataset and the baseline model performance.
By permuting features as a group, we can reduce the risk of correlated signals.
Feature Group Permutation Case Study: Suspicious Attachments
One of the models we use at Abnormal Security is our attack multi model, which consumes a wide range of feature types and predicts the likelihood that a particular email is an attack. One question we had recently about this model was how it determined that a particular message contained a malicious attachment. The model has access to a ton of different sources of information, so here are a few hypotheses about how it could make this decision:
Primarily focus on the sender and recipient information.
Primarily focus on the subject and header text in the email.
Primarily focus on the body text in the message.
Primarily focus on the signals in the attachment itself.
We can use feature group permutation to solve this. First, we arrange our features based on the type of data it is constructed from. Then, we pull all messages with attachments and compute the feature group importance over this dataset.
We see that the attachment features are only somewhat important, since the model is able to catch attacks from the other signals on the message.
However, this is not the complete story. There are certain types of messages for which the attachment is more central to the message due to the language used. On these messages, we would expect that the model would need to pay closer attention to the attachment in order to detect attacks. If we limit our dataset to the subset of messages that contain these attachment types and recompute the feature group importance, we see the importance of the attachment features increase:
Using Feature Group Permutation Effectively
Despite its simplicity, feature group permutation is an extremely powerful tool. Here are a few tips for using it effectively:
Choosing Feature Groups: If the information represented by the features in some group G is also represented by features outside of G, then it is possible that the feature group permutation algorithm will underestimate the importance of G. For this reason it is usually helpful to start with very large groups and then only break down groups with high importance.
Choosing the Dataset: Not all features are useful on all samples. Certain signals may be extremely predictive, but only present very rarely. For this reason it is usually helpful to choose the dataset based on the questions we have about the model behavior.
Choosing the Metric: The feature group permutation algorithm computes the importance of a group in terms of some metric of model performance. Certain metrics will capture different kinds of importance. For example, suppose we are studying a binary classification model. A feature that determines the overall calibration of the model will be attributed high importance by the cross-entropy loss metric but low importance by the ROC-AUC metric.
Feature Importances at Abnormal
At Abnormal Security we use feature importance analysis to understand our detection models. This helps us validate that new signals are useful or anticipate changes in model behavior. Understanding the feature importance distribution also enables us to anticipate which kinds of attacks might slip through our models so that we can prioritize feature development to improve our system.
Want to join our team to work on these problems? Abnormal is hiring! Check out our open roles to learn more.