From Data to Detection: Crafting AI for Email Threat Detection
Imagine one morning, this email appears in your inbox:
Your reaction might be, “Company swag, exciting! Wait, is this real?”
First, check the sender’s email address. A legitimate email from Tractor Supply Co. would come from a company-specific domain, not a generic one like Outlook. Even though the email may look professional, phishing emails often mimic legitimate ones. Also watch for red flags like urgency or too-good-to-be-true offers. Hover over the link to inspect the URL; if it seems suspicious, it’s likely phishing.
Our detection team aims to replicate human intuition to assess email threats. By analyzing digital signals with the power of NLP for threat detection, we can detect anomalies and flag suspicious content with high accuracy.
The Crucial Role of Signal Extraction and Context Analysis
The first step of email threat detection involves dissecting various components to assess legitimacy, which has proven to be useful for both human analysis and machine learning. Our detection engine extracts thousands of signals from each email, such as the sender’s domain, embedded links, and financial-related keywords. However, these surface signals alone aren’t enough. For example, an email from an outlook[.]com address may not raise immediate red flags, but if it’s attempting to trick the recipient into clicking a link, the danger becomes apparent.
To enhance detection, we go beyond surface-level signals and explore secondary signals, such as interaction frequency between the sender and recipient, the originating IP address, and any blocklist history. Machines excel at extracting and analyzing these nuanced details, often making quicker and more accurate judgments than humans.
Additionally, advanced signals require rich contextual analysis. Consider a U.S.-based employee receiving a financial request from Nigeria—a scenario that might raise concerns due to the location's association with attacks. Yet, legitimate traffic can also originate from Nigeria. By analyzing the recipient's history, such as the frequency of financial communications from Nigeria and the writing style, we can form a more comprehensive assessment of the email’s legitimacy.
Behind the scenes, we train models to establish the context and evaluate the likelihood of different scenarios, determining whether an email is a legitimate marketing message or a phishing attempt. By combining contextual understanding with signal extraction, we ensure that each email is handled with the appropriate level of scrutiny. For instance, an email flagged by our system was identified as a rare contact, with no graymail history, and contained only an image urging a click—an immediate red flag.
Unleashing the Power of the Abnormal Rule Engine with Smarter Rule Crafting
Our detection engine operates on multiple layers to protect users—leveraging machine learning models, attack signatures, and both primary and secondary email features, as discussed earlier. This system analyzes each incoming email by extracting signals and generating predictive scores to identify indicators of compromise, effectively filtering out malicious content from user inboxes.
When specific attack patterns are identified, an email security analyst can easily translate these insights into expressive rules that are then deployed to the engine to defend against similar threats in the future. For example, a rule for flagging the scenario mentioned above might look like this:
( sender_from_email_hosting_domain = true AND body_has_anchor_image = true AND body_text_length = 0 AND subject_has_engagement_vocab = true )
In short, a rule is a hierarchical boolean expression constructed from ANDs and ORs over a set of criteria. As our customer base grows rapidly, manual rule-writing has become a bottleneck, highlighting the need for a more scalable solution to maintain and improve effective detection. So how can we leverage some modeling techniques to solve this problem?
When considering the nature of rules, it inevitably brings to mind Decision Trees—a model that uses a flowchart-like structure to make decisions based on input data. It splits data into branches based on feature values, leading to decision nodes that guide the final prediction. Each path from the root to a leaf node represents a decision rule, making the model easy to interpret and visualize.
The formation of detection rules resembles a tree-like structure, where each signal serves as a test to filter in or out a particular message, and each leaf node represents the machine-determined classification of that message. This structured modeling approach, in principle, mirrors the thinking process of rule creation and addresses the scalability challenge. With the decision tree model, we efficiently and effectively ingest the vast number of signals and isolate suspicious patterns from billions of emails on a daily basis—all while staying ahead of increasingly sophisticated attacks.
During the proposal phase of this modeling usage, we started by exploring a variety of decision tree models, ranging from the basic to those incorporating some level of randomization. Despite approaching this from an experimental perspective, we recognized that prototyping the idea with 100% of traffic fed into the model wasn’t feasible due to concerns around computational cost and model performance. To address this, we validated the idea through a series of strategic decisions during the development process. This included careful data sampling, parameter and optimization function settings, and a rigorous evaluation process.
It turned out that the results were promising. The introduction of randomization—as seen in models like random forest—added robustness to rule creation, enabling us to better generalize from known attack patterns and handle edge cases without causing false positives. Additionally, the use of weighted sampling techniques allowed us to differentiate between attacks and legitimate emails while also reducing computational loads when the model is productionalized. This balanced approach ensured that our system remained both efficient and highly accurate in real-world scenarios.
These experiments laid the groundwork for our current approach, where the combination of algorithm-powered rule creation and human intelligence enables us to tackle the complexity and scale of modern email threats.
Our Hybrid Approach to Better Protect Our Customers
Abnormal’s advanced email threat detection systems leverage a fusion of heuristic techniques and machine learning algorithms to accurately detect and neutralize malicious content. By implementing decision tree models, we've improved our system's ability to generalize from diverse threat signatures, ensuring high detection accuracy and computational efficiency.
This is just one example of our hybrid approach, which combines AI-driven automation with expert domain knowledge to enhance our security infrastructure and streamline operational processes. We’re proud of our robust, scalable framework that fortifies our defenses and offers enhanced protection for our customers against sophisticated email threats.
As a fast-growing company, we have lots of interesting engineering challenges to solve, just like this one. If these challenges interest you, and you want to further your growth as an engineer, we’re hiring! Learn more at our careers website.