At Abnormal, the problems we are trying to solve are not that much different than those being tackled by other organizations, including non-startups. What is unique to startups are the additional constraints placed on the solution space, such as the amount of time/money/engineers that can be thrown at a problem. In this series, we will explore some of the challenges we face and the unique ways we have overcome these challenges.
We recently were faced with the challenge of scaling our online KV (key-value, a.k.a., NoSQL) datasets, which had become one of the biggest limiting factors for us to take additional traffic from new customers. This was primarily driven by two datasets: Behavior DB and Mailbox DB (there is a third, Person DB, but that is a story for another day). The former, as its name suggests, is a dataset containing statistics about the behavior associated with all messages sent to or from an organization (and the individuals involved). The latter is a dataset containing metadata about each mailbox (i.e., email account). Both of these datasets grow linearly with the number of mailboxes being protected. Additionally, Behavior DB grows with the average number of data points generated per message.
We knew this was going to be an issue that we had to face but made the conscious decision to delay investing in a solution until the issue became more urgent. The issue became urgent when we only had enough runway to increase the number of mailboxes being protected by approximately 25% with a projected mailbox growth of 100% in the two months that followed.
Before we start, it would be helpful to have some shared context. If we zoom all of the way out, at the heart of the real-time component of our email protection product is the Real-Time Scorer (more affectionately known as the RT Scorer):
When an email message is delivered, it is first sent to the RT Scorer to generate a judgment (e.g., safe, suspicious, malicious, etc.) and is then sent along to the Post-Scoring Processor to take action related to the judgment (e.g., quarantining the message if it is malicious, injecting a banner in the body of the message if it’s suspicious, surfacing the threat in the customer portal if it is an unsafe message, etc.). The RT Scorer extracts metadata about the message (e.g., sender, recipient, email headers, etc.), collects data related to the entities of the message (e.g., sender, recipient, attachments, etc.), then passes that data on to the various detectors to provide a judgment. The RT Scorer is a stateless service running in a docker container. It scales horizontally with message volume by adding more instances. In theory, it should be infinitely scalable.
In the early days, Behavior DB and Mailbox DB were both small. We had a dozen or so instances processing double-digit queries per second (QPS) protecting approximately ten thousand mailboxes spread across a handful of customers. Behavior DB contained about 5 million keys, which fit nicely into a 100 MB database file. Mailbox DB was smaller and could easily fit into 20 MB. These datasets were (and still are) generated in batch (using Spark) and are updated daily.
Back then, we were faced with the difficult decision about how to make these datasets available to our detection system. Like many decisions, this one was about tradeoffs. In the startup life, time is precious and these decisions must be made quickly; there generally isn’t enough time to complete an in-depth evaluation of each viable option. Using familiar technologies (even though it may not be a perfect fit) and using them in simple ways brings velocity and a high chance of success. For the first version, we ultimately chose to keep the system simple and distribute the datasets to all RT Scorer servers so that they could access the data locally and not have to worry about the complexity of adding a remote procedure call into what was a self-contained, monolithic service. This simplicity and speed came at the cost of technical debt of a system that would not scale well. Behavior DB was deployed as a RocksDB database and was accessed via a library, directly off of disk. Mailbox DB was deployed as a compressed CSV file and was loaded into memory.
As our detection system became more sophisticated, the number of keys per message being materialized in the Behavior DB grew and so did its size. As the number of mailboxes we were protecting grew, so did the size of both datasets. By May 2020, both Behavior DB and Mailbox DB had grown by two orders of magnitude. The process of loading, unpacking, and opening these datasets was now taking more than 10 minutes and used a considerable portion of memory. Since the RT Scorer could not function without these datasets, this severely impacted our ability to launch new instances, affecting auto-scaling operations (it was sluggish), deploys (what used to take 10 minutes was taking over an hour), and S3 transfer costs (which were inflated). Mailbox DB would soon exhaust the available memory on the RT Scorer and to increase this (because of some fundamental technical details of how the RT Scorer is built) would mean running the servers with twice the memory and CPU (at twice the cost) without being able to utilize the additional compute power.
We were faced with a problem. We were projecting the number of mailboxes to grow by a factor of five by the end of the year (which turned out to be spot on) and then by another factor of ten by the end of the following year. More urgently, the mailbox count was set to double in the next two months.
We faced the following challenges:
Our solution must have met the following requirements:
Given these requirements and restrictions, we looked at our options. What follows are the most promising ones:
Option 1: Partition RT Scorer by customer and scale horizontally by adding more clusters as more customers are added
Option 2: Scale the RT Scorer vertically by adding bigger disks, adding more memory, and caching datasets locally
Option 3 (publishing data to a KV Store) seemed the most promising. However, it would introduce some risk.
Is there a way to get the best of both worlds? Could we take our existing RocksDB datasets and serve them from a centralized service? If so, this would eliminate much of the risk.
The approach we took was to use the current RocksDB dataset and serve it using rocksdb-server. rocksdb-server a simple service that exposes a Redis-compatible API on top of the RocksDB API. It serves a single database at a time and implements a tiny subset of the supported commands. This is a great solution for our use case because:
With these new servers, the RT scorer diagram now looks like the following:
Introducing changes to any system comes with risk. In general, the more significant the change, the higher the risk. There are several patterns that can be followed to help de-risk any project. In this case, we chose the following strategies:
De-risking the high-risk items
High-risk items can upturn an otherwise sound solution and waiting too long to figure this out can jeopardize its delivery. For this solution, there were a few things that we didn’t know or were worried about. In this particular case, most of our worries were around the rocksdb-server, since we had no experience with it. Specifically:
The approach we took to de-risk this solution was to first create a docker image that would download an arbitrary dataset from S3 and launch rocksdb-server pointing to that dataset. Once this was done, it became possible to test all of the performance-related risks. More on this later.
Crawl/Walk/Run is an approach to delivering value that encourages early successes. It is an invaluable technique that demonstrated its value in this project. Thinking this way helps you focus on small improvements that deliver value without having to build the perfect solution right away. Build something small, learn, and improve over time. To solve our problem, all we have to do is crawl; we don’t need to run.
For the crawl, we were able to sacrifice some of the functionality that would be needed in the future but not needed now. E.g.:
Sure, without this functionality our solution is very rough around the edges, but it’s solving a big problem.
Even with the most thorough testing, it’s hard to say with absolute certainty that an implementation will not break things. This is especially true when replacing one already-working-thing (that others are depending on) with something else. A common technique to mitigate this risk is the dark launch. In a dark launch, the new solution is exercised to its fullest but results are not used. Typically, it’s controlled by a knob that can send anywhere from 0% to 100% of traffic through the new solution.
In the case of Behavior DB and Mailbox DB, our dark launch consisted of the following functionality.
This was controlled by dynamic configuration that indicated the percentage of requests that should be asynchronously fetched from secondary storage. A second flag controlled which percentage of requests used the remote storage as primary instead of the local storage. With these two flags we were able to slowly increase the traffic sent to the new rocksdb-server instances and observe how they performed as well as transition the remote storage from dark mode to live mode without having to redeploy any code.
By modifying these flags and looking at the comparison and latency metrics of the remote storage, we were able to be certain that the new remote storage provided byte-for-byte identical responses to the existing local storage and that the latency was within specification, all without any negative impact to production traffic.
We had a smooth rollout. All of the instrumentation we added was able to uncover issues that we were able to fix. Some of these fixes came in the form of improvements to rocksdb-server (which can be found on the scale-up branch of this fork).
We experienced two main issues during dark launch: excessive latency and discrepancies between old and new storage.
The source of the discrepancies was an interesting problem to track down (and worthy of being covered with greater depth in a follow-up post). In short, the problems that we encountered had two main causes:
Tracking down latency issues was a similarly interesting problem to solve (also worthy of being covered in a follow-up post). Briefly, we tracked down the following sources of latency:
Overall, I think this was a very interesting and successful project. We were able to swap out the database from our application without any impact in production, reduced startup time from 10+ minutes to 30 seconds, and gave us enough runway to work on improving the “Crawl” and to focus on solving other problems.
Behavior DB and Mailbox DB have each grown 3x in size since we started working on our read-only KV Store server six months ago. Although they are still holding up, there are some short- and long-term improvements we would like to make.
Beyond improvements to how we serve batch data, there is still a ton more to do to help scale the business! We need to figure out how to efficiently scale the RT Scorer 10x and beyond. As a force multiplier for the organization, the infrastructure/platform team needs to figure what common design patterns to provide to engineers (e.g., data processing, storage solutions, etc) and how to better support a growing engineering team by making it easier for engineers to build, test, deploy, and monitor low-latency, high-throughput services so engineers can spend more time solving customer problems and less time deploying and maintaining services. If these problems interest you, yes, we’re hiring!
Abnormal is the email security company that stands for trust.
© 2021 Abnormal Security Corporation.
All rights reserved.