chat
expand_more

From Data to Detection: Crafting AI for Email Threat Detection

Discover how Abnormal Security leverages AI and decision trees to extract signals, analyze context, and detect sophisticated email threats with high accuracy.
August 23, 2024

Imagine one morning, this email appears in your inbox:

Data to Detection Blog Email Mockup

Your reaction might be, “Company swag, exciting! Wait, is this real?”

First, check the sender’s email address. A legitimate email from Tractor Supply Co. would come from a company-specific domain, not a generic one like Outlook. Even though the email may look professional, phishing emails often mimic legitimate ones. Also watch for red flags like urgency or too-good-to-be-true offers. Hover over the link to inspect the URL; if it seems suspicious, it’s likely phishing.

Our detection team aims to replicate human intuition to assess email threats. By analyzing digital signals with the power of NLP for threat detection, we can detect anomalies and flag suspicious content with high accuracy.

The Crucial Role of Signal Extraction and Context Analysis

The first step of email threat detection involves dissecting various components to assess legitimacy, which has proven to be useful for both human analysis and machine learning. Our detection engine extracts thousands of signals from each email, such as the sender’s domain, embedded links, and financial-related keywords. However, these surface signals alone aren’t enough. For example, an email from an outlook[.]com address may not raise immediate red flags, but if it’s attempting to trick the recipient into clicking a link, the danger becomes apparent.

To enhance detection, we go beyond surface-level signals and explore secondary signals, such as interaction frequency between the sender and recipient, the originating IP address, and any blocklist history. Machines excel at extracting and analyzing these nuanced details, often making quicker and more accurate judgments than humans.

Additionally, advanced signals require rich contextual analysis. Consider a U.S.-based employee receiving a financial request from Nigeria—a scenario that might raise concerns due to the location's association with attacks. Yet, legitimate traffic can also originate from Nigeria. By analyzing the recipient's history, such as the frequency of financial communications from Nigeria and the writing style, we can form a more comprehensive assessment of the email’s legitimacy.

Behind the scenes, we train models to establish the context and evaluate the likelihood of different scenarios, determining whether an email is a legitimate marketing message or a phishing attempt. By combining contextual understanding with signal extraction, we ensure that each email is handled with the appropriate level of scrutiny. For instance, an email flagged by our system was identified as a rare contact, with no graymail history, and contained only an image urging a click—an immediate red flag.

Unleashing the Power of the Abnormal Rule Engine with Smarter Rule Crafting

Our detection engine operates on multiple layers to protect users—leveraging machine learning models, attack signatures, and both primary and secondary email features, as discussed earlier. This system analyzes each incoming email by extracting signals and generating predictive scores to identify indicators of compromise, effectively filtering out malicious content from user inboxes.

When specific attack patterns are identified, an email security analyst can easily translate these insights into expressive rules that are then deployed to the engine to defend against similar threats in the future. For example, a rule for flagging the scenario mentioned above might look like this:

(
sender_from_email_hosting_domain = true AND
body_has_anchor_image = true AND
body_text_length = 0 AND
subject_has_engagement_vocab = true
)

In short, a rule is a hierarchical boolean expression constructed from ANDs and ORs over a set of criteria. As our customer base grows rapidly, manual rule-writing has become a bottleneck, highlighting the need for a more scalable solution to maintain and improve effective detection. So how can we leverage some modeling techniques to solve this problem?

When considering the nature of rules, it inevitably brings to mind Decision Trees—a model that uses a flowchart-like structure to make decisions based on input data. It splits data into branches based on feature values, leading to decision nodes that guide the final prediction. Each path from the root to a leaf node represents a decision rule, making the model easy to interpret and visualize.

Data to Detection Blog Flowchart

An illustrative flowchart of a decision tree model, showing decision nodes, branches, and paths leading to final predictions.

The formation of detection rules resembles a tree-like structure, where each signal serves as a test to filter in or out a particular message, and each leaf node represents the machine-determined classification of that message. This structured modeling approach, in principle, mirrors the thinking process of rule creation and addresses the scalability challenge. With the decision tree model, we efficiently and effectively ingest the vast number of signals and isolate suspicious patterns from billions of emails on a daily basis—all while staying ahead of increasingly sophisticated attacks.

During the proposal phase of this modeling usage, we started by exploring a variety of decision tree models, ranging from the basic to those incorporating some level of randomization. Despite approaching this from an experimental perspective, we recognized that prototyping the idea with 100% of traffic fed into the model wasn’t feasible due to concerns around computational cost and model performance. To address this, we validated the idea through a series of strategic decisions during the development process. This included careful data sampling, parameter and optimization function settings, and a rigorous evaluation process.

It turned out that the results were promising. The introduction of randomization—as seen in models like random forest—added robustness to rule creation, enabling us to better generalize from known attack patterns and handle edge cases without causing false positives. Additionally, the use of weighted sampling techniques allowed us to differentiate between attacks and legitimate emails while also reducing computational loads when the model is productionalized. This balanced approach ensured that our system remained both efficient and highly accurate in real-world scenarios.

These experiments laid the groundwork for our current approach, where the combination of algorithm-powered rule creation and human intelligence enables us to tackle the complexity and scale of modern email threats.

Our Hybrid Approach to Better Protect Our Customers

Abnormal’s advanced email threat detection systems leverage a fusion of heuristic techniques and machine learning algorithms to accurately detect and neutralize malicious content. By implementing decision tree models, we've improved our system's ability to generalize from diverse threat signatures, ensuring high detection accuracy and computational efficiency.

This is just one example of our hybrid approach, which combines AI-driven automation with expert domain knowledge to enhance our security infrastructure and streamline operational processes. We’re proud of our robust, scalable framework that fortifies our defenses and offers enhanced protection for our customers against sophisticated email threats.


As a fast-growing company, we have lots of interesting engineering challenges to solve, just like this one. If these challenges interest you, and you want to further your growth as an engineer, we’re hiring! Learn more at our careers website.

From Data to Detection: Crafting AI for Email Threat Detection

See Abnormal in Action

Get a Demo

Get the Latest Email Security Insights

Subscribe to our newsletter to receive updates on the latest attacks and new trends in the email threat landscape.

Get AI Protection for Your Human Interactions

Protect your organization from socially-engineered email attacks that target human behavior.
Request a Demo
Request a Demo

Related Posts

B Evilginx
Discover how cybercriminals are using Evilginx to bypass multi-factor authentication (MFA) in attacks targeting Gmail, Outlook, Yahoo, and more.
Read More
B Malicious AI
Discover how AI is being used for bad as hackers leverage it to carry out their cybercrimes, in this recap of a white paper from hacker FC.
Read More
B MKT689 Cyber Savvy Open Graph Images
Discover how Alex Wood, CISO at Uplight, tackles evolving cybersecurity threats with AI-driven solutions.
Read More
B Osterman Recap
Discover key insights from Osterman Research’s latest report on modernizing MFA to tackle rising identity threats.
Read More
B Transportation Industry Attack Trends Blog
Explore the latest attack trends in the transportation industry and learn how to defend against rising threats like phishing, BEC, and VEC.
Read More
B F1000 Manufacturer Replaces Proofpoint with Abnormal
A global industrial manufacturer enhanced its email security and operational efficiency by replacing its Proofpoint SEG with Abnormal.
Read More