At the core of all Abnormal’s detection products sits a sophisticated web of prediction models. For any of these models to function we need deep and thoughtfully engineered features, careful modeling of sub-problems, and the ability to join data from a set of databases.
For example, one type of email attack we detect is called Business Email Compromise (BEC). A common BEC attack is a “VIP impersonation” in which the attacker pretends to be the CEO or other VIP in a company in order to convince an employee to take some action. Some of the inputs to a model detecting this sort of attack include:
All these attributes are carefully engineered and may rely on one another in a directed graphical fashion.
This article describes Abnormal’s graph-of-attributes system which makes this type of interconnected modeling scalable. This system has enabled us to grow our ML team while continuing to rapidly innovate.
We store all original entity data as rich thrift objects (for example a thrift object representing an email or an account sign-in). This allows flexibility in terms of the data types we log, enables easy, backward compatibility, and understandable data structures. But as soon as we want to convert this data into something that will be consumed by data science engines and models, we should convert these into attributes. An attribute is a simply-typed object (float / int / string / boolean) with a numeric attribute ID.
Attribute vs Features — Attributes are conceptually similar to features, but they might not be quite ready to feed into an ML model. These should be ready to convert into a form consumable by models. All the heavy lifting should occur at the time of attribute extraction, for example running inference on a model or hydrating from a database.
The core principles we are working off include:
Consuming Attributes — Once data is converted into a columnar format it can be consumed in many ways: Ingested into a columnar store for analytics, tracked in metrics to monitor distributional shifts, and converted directly into a feature dataframe ready for training with minimal extra logic.
Computing attributes as a directed graph allows enormous flexibility for parallel development by multiple engineers. If each attribute declares its inputs, we can ensure everything is queried and calculated in the correct order. This enables attributes of multiple types:
Attribute Hydration Graph:
Explicitly encoding the graph of attributes seems complex but it will save you painful headaches down the road when you want to use one attribute as an input to another.
Inevitably we will want to iterate on attributes and the worst feeling is realizing that the attribute you want to modify is used by ten downstream models. How do you make a change without retraining all those models? How do you verify and stage the change?
This situation comes up frequently. Some common cases:
If each attribute is versioned and downstream consumers register which version they wish to consume, then we can easily bump the version (while continuing to compute the previous versions) without affecting the consumers.
In addition to enabling flexible modeling of complex problems, this graph of models enables us to scale our ML engineering team. Previously we had a rigid pipeline of features and models which was really only amenable to a single ML engineer at a time to develop. Now, we can have multiple ML engineers developing models for sub-problems, and then combining the resulting features and models together later.
We need to figure out how to more efficiently re-extract this graph of attributes for historical data and good processes for sunsetting older attributes. We would like to build a system that allows our security analysts and anyone else in the company to easily contribute attributes and allow those to automatically flow into downstream models and analysis. We need to improve our ability to surface relevant attributes and models scores important to a given decision back to the client to understand the reasons an event is flagged. And so much more… If these problems interest you, yes, we’re hiring!
Abnormal is the email security company that stands for trust.
© 2021 Abnormal Security Corporation.
All rights reserved.