Graph of Models and Features

February 2, 2021

At the core of all Abnormal’s detection products sits a sophisticated web of prediction models. For any of these models to function, we need deep and thoughtfully engineered features, careful modeling of sub-problems, and the ability to join data from a set of databases.

For example, one type of email attack we detect is called business email compromise (BEC). A common BEC attack is a “VIP impersonation” in which the attacker pretends to be the CEO or other VIP in a company in order to convince an employee to take some action. Some of the inputs to a model detecting this sort of attack include:

  1. Model evaluating how similar the sender’s name appears to match a VIP (indicating impersonation)
  2. NLP models applied to the text of the message
  3. Known communication patterns for the individuals involved
  4. Identity of the individuals involved extracted from an employee database
  5. … and many more

All these attributes are carefully engineered and may rely on one another in a directed graphical fashion.

This article describes Abnormal’s graph-of-attributes system which makes this type of interconnected modeling scalable. This system has enabled us to grow our ML team while continuing to rapidly innovate.

Attributes

We store all original entity data as rich thrift objects (for example a thrift object representing an email or an account sign-in). This allows flexibility in terms of the data types we log, enables easy, backward compatibility, and understandable data structures. But as soon as we want to convert this data into something that will be consumed by data science engines and models, we should convert these into attributes. An attribute is a simply-typed object (float / int / string / boolean) with a numeric attribute ID.

Image for post

Attribute vs Features: Attributes are conceptually similar to features, but they might not be quite ready to feed into an ML model. These should be ready to convert into a form consumable by models. All the heavy lifting should occur at the time of attribute extraction, for example running inference on a model or hydrating from a database.

The core principles we are working off include:

  • Attributes can rely on multiple modes of inputs (Other raw attributes, Outputs of models, Data hydrated from a database lookup or join)
  • Attributes should be flat data (i.e. primitives) and representable in a columnar database
  • Attributes should be simple to convert to features (for example you may need to convert a categorical attribute into a one-hot vector)
  • We will always need to change and improve attributes over time

Consuming Attributes: Once data is converted into a columnar format, it can be consumed in many ways—ingested into a columnar store for analytics, tracked in metrics to monitor distributional shifts, and converted directly into a feature dataframe ready for training with minimal extra logic.

Directed Graph of Attributes

Computing attributes as a directed graph allows enormous flexibility for parallel development by multiple engineers. If each attribute declares its inputs, we can ensure everything is queried and calculated in the correct order. This enables attributes of multiple types:

  1. Raw features
  2. Heuristics that use many other features as input
  3. Models that make a prediction from many other features
  4. Embeddings

Our Attribute Hydration Graph looks like this.

Image for post

Explicitly encoding the graph of attributes seems complex, but it saves us painful headaches down the road when we want to use one attribute as an input to another.

Attribute Versioning

Inevitably, we will want to iterate on attributes and the worst feeling is realizing that the attribute you want to modify is used by ten downstream models. How do you make a change without retraining all those models? How do you verify and stage the change?

This situation comes up frequently. Some common cases include:

  • An attribute is the output of a model or an embedding. You want to re-train the model, but this attribute is used by other models or heuristics.
  • An attribute relies on a database serving aggregate features and we would like to experiment with different aggregate bucketizations.
  • We have a carefully engineered heuristic feature and we would like to update the logic.

If each attribute is versioned and downstream consumers register which version they wish to consume, then we can easily bump the version (while continuing to compute the previous versions) without affecting the consumers.

Scaling a Machine Learning Team

In addition to enabling flexible modeling of complex problems, this graph of models enables us to scale our machine learning engineering team. Previously, we had a rigid pipeline of features and models which was really only allowed a single ML engineer at a time to develop. Now, we can have multiple ML engineers developing models for sub-problems, and then combining the resulting features and models together later.

We need to figure out how to more efficiently re-extract this graph of attributes for historical data and good processes for sunsetting older attributes. We would like to build a system that allows our security analysts and anyone else in the company to easily contribute attributes and allow those to automatically flow into downstream models and analysis. We need to improve our ability to surface relevant attributes and models scores important to a given decision back to the client to understand the reasons an event is flagged. And so much more… If these problems interest you, we’re hiring!

Related Posts

B 12 03 22 SIEM
Learn about Abnormal’s enhanced SIEM export schema, which provides centralized visibility into email threats
Read More
Blog phishing cover
The phishing email is one of the oldest and most successful types of cyberattacks. Attackers have long used phishing as a common attack vector to steal sensitive information or credentials from their victims. While most phishing emails are relatively simple to spot, the number of successful attacks has grown in recent years.
Read More
Blog brand cover
For those of you who have visited the Abnormal website over the last month, you’ve seen something different—a redesigned brand focused on precision. It’s new and innovative, and different from any other cybersecurity company, because it was created with one thing in mind: our customers.
Read More
B 11 22 21 AAA
At Abnormal, our customers have always been our biggest priority. Customer obsession is one of our five company values, and we live this every single day as we provide the best email security protection available for the hundreds of companies who entrust us to protect their mailboxes.
Read More
Blog microsoft abnormal cover
Before we jump into modern threats, I think it’s important to set the stage ​​since email has been around. Since email existed, threat actors targeted email users with malicious messages, general spam, and different ways to take advantage of the platform. Then of course, more dangerous attacks started to come up… things like malware and other viruses.
Read More
Blog black friday scam cover
While cybersecurity awareness is a year-round venture, it is especially important to be mindful during certain times of the year. With Thanksgiving here in the United States on Thursday, our thoughts will likely be on our family and friends and everything we have to be thankful for this holiday season.
Read More
Blog automation workflows cover
Our newest platform capabilities help customers streamline critical security workflows, like triaging phishing mailbox submissions or triggering tickets to investigate account takeovers, through automated playbooks. Doing so can decrease mean time to respond (MTTR) to incidents, further reducing any potential risk to the organization and eliminating manual workflows to save time and increase the efficiency of IT and security teams.
Read More
Blog tsa scam cover
On November 9, 2021, we identified an unusual phishing email that claimed to be from “Immigration Visa and Travel,” inviting the recipient to renew their membership in the TSA PreCheck program. The email wasn’t sent from a .gov domain, but the average consumer might not immediately reject it as a scam, particularly because it had the term “immigrationvisaforms” in the domain. The email instructed the user to renew their membership at another quasi-legitimate-looking website.
Read More
Blog pyspark cover
At Abnormal Security, we use a data science-based approach to keep our customers safe from the most advanced email attacks. This requires processing huge amounts of data to train machine learning models, build datasets, and otherwise model the typical behavior of the organizations we’re protecting.
Read More
Blog tiktok attack cover
As major social media platforms have expanded the ability of creators to monetize their content in the last few years, they and their users have increasingly found themselves the targets of malicious activity. TikTok is now no exception.
Read More
Blog ransomware guide cover
While various state agencies and the private sector keep track of ransomware attacks and related tactics worldwide, malicious actors change and evolve their ransomware strategies all the time. We’ve put together a comprehensive guide that will define ransomware, how to detect it, and what steps to take if you’ve fallen victim to a ransomware virus attack.
Read More
Blog detection efficacy cover
One of the key objectives of the Abnormal platform is to provide the highest precision detection to block all never-before-seen attacks. This ranges from socially-engineered attacks to account takeovers to everyday spam, and the platform does it without customers needing to create countless rules like with traditional secure email gateways.
Read More