Abnormal Adventures with Scaling Out Redis
One of the ways we detect attacks at Abnormal is by keeping track of how many times we’ve observed certain features, or combinations of features, appear in messages. For example, how many times we’ve seen a message from johndoe[@]customer.com, or how many times we’ve seen johndoe[@]customer.com send a message to jane[@]customer.com.
We refer to each occurrence as an “event.” By maintaining counts of how many times we’ve seen similar events in the past, and also how many times those events have been associated with attacks, we can use the ratio of attack messages to safe messages to evaluate the likelihood that a new message with a given event may be an attack as well.
At Abnormal, we use Redis to keep track of these counts. Redis is an in-memory key-value database that provides lightning-fast reads and writes, and is especially well-suited for keeping track of the count data we rely on at Abnormal.
Background
The main reason Redis operations are so fast is it stores all of the data in memory instead of on disk, and fetching the value for a given key does not incur the additional latency from disk I/O. The upshot of this approach is the dataset stored on the Redis server needs to be small enough to fit entirely in memory.
For example, if I run a Redis server on my laptop with 16GB of RAM, the dataset I store on that server can be no larger than 16GB (and typically smaller in practical applications). As such, there’s a limit to the total amount of data you can store for a given machine.
As we rapidly scaled our customer base, the size of our event datasets grew as well. These datasets are stored on AWS-managed Elasticache, which offers various machine instance configurations ranging from 0.5 GB to 636 GB of memory. Each time the size of our dataset began approaching the memory limit of the Elasticache instance storing the dataset, we would migrate the dataset to the next largest instance. This is also known as “vertical” scaling.
Unfortunately, vertically scaling a Redis server has its limits, and you can only store as much data as the largest instance offers. In this case, that’s 636 GB. As of a few months ago, we found ourselves rapidly approaching this limit; our largest counts dataset was over 400 GB and growing. With our rapid growth, we were on track to hit the 636 GB limit in two months, so we needed to figure out a way to scale beyond this—and fast!
After doing some research, we discovered we had several options to handle more event data, and each had its own tradeoffs.
Option 1: Cut Down Our Key Size
We use a 256-bit hash of an event, represented in hex as a 64-character string, as the keys for our Redis dataset. In a Redis server, each character takes up one byte, so each of these keys took up 64 bytes plus a 56-byte overhead Redis allocates per key, for a total of 120 bytes per key.
Our values are integers or maps of integers to integers, and, on average, our keys (including the overhead) had roughly the same memory footprint as our values. In short, we were consuming 25% of the available memory in our database—just for the key strings!
One option we had was to use the hex representation of a 128-bit hash instead, which was still a sufficiently large enough mapping range to avoid hash collisions between keys. The 128-bit hash would produce a 32-character string and reduce our per-key memory footprint from 120 bytes down to 88 bytes—enough to support an additional 15% of key-value pairs on top of our previous limit.
This would be easy to implement, but at our rate of growth, it only equated to an additional two weeks of runway, which was definitely not enough! We needed to explore more drastic infrastructure changes, and the two most promising options were Redis on Flash and Redis Cluster.
Option 2: Redis on Flash a.k.a Data-tiering
Redis on Flash is a recently-released feature that uses a hybrid in-memory + SSD approach to store more data on a Redis server than the available RAM on the instance. This allows for significantly increased data storage capacity with lower resource requirements and can help reduce costs for the right use cases.
AWS Elasticache has a managed Redis on Flash offering referred to as “data tiering”. It requires migrating your cache to the latest cache.r6gd instance class, but the largest data-tiered instance + SSD combination provides nearly 2 TB of storage—enough to extend our runway by at least 6 months! Since this was a vertical scaling operation similar to those we had performed in the past, we could use the same Redis client and wouldn’t need to worry about any of the complexities introduced by horizontal scaling.
How Does Redis on Flash Work?
Rather than storing the entire dataset in memory, Redis on Flash only stores the keys, the Redis dictionary used for key-value mappings, and a small subset of “hot-values” in memory. It uses an asynchronous process running an LRU algorithm to move “lesser-used” values from memory onto the SSD, freeing up memory to store additional keys.
When the server receives an operation for a key with a value residing on disk, the asynchronous process loads the value back into memory and handles the request as normal. The disk read is a fairly fast operation, but still introduces a fair amount of latency overhead compared to the lightning-fast in-memory lookups. As such, Redis on Flash is recommended for use cases where 20% of the “dataset” is accessed 80% of the time. (We’ll go into more detail about how the wording around this came back to bite us later.)
The upshot of this access pattern is the most frequently accessed items in the dataset can be served directly from memory, meaning 80% of operations don’t require a disk read, which keeps overall read/write performance almost on par with a dataset served entirely from memory.
Will It Work for Us?
With this information in hand, we needed to determine whether our dataset access pattern was aligned with the requirements prescribed by Redis on Flash documentation. By running some batch analysis jobs on our real-time processing logs, we were able to determine our key-access pattern looked something like this:
Our Data-tiering Adventure
We went about setting up a new Elasticache cluster with the data-tiering feature enabled, and once everything was up and running, we started directing a mirrored copy of production write operations at the new cluster.
For the first few days, everything seemed to be running smoothly. Operation latency was on par with our standard Redis Elasticache instance, and items were stored in memory until the memory was fully consumed at which point they were spilled to disk.
Then Monday morning hit and everything went south:
Write latency spiked along with disk I/O, client operations began timing out at increasingly high rates, and the replication lag for our Elasticache instance began climbing. As a result, our writes were either incredibly slow or timing out and failing, meaning we were losing count data, and any reads from the Elasticache replica instances were likely to be out of date from the replication lag. We hopped on a call with AWS to figure out what was going on.
What We Found Out
There were a couple of key details we initially missed (or were only made clear in the Redis on Flash docs, and not the AWS data-tiering docs).
The first was that Redis on Flash still keeps all keys in memory, and the LRU algorithm will only spill the values to disk. It’s best used when the average size of keys is smaller than the average size of values. As mentioned before, our keys and values are roughly the same size on average.
Furthermore, the best practice requirement that 80% of your operations cover only 20% of your “dataset” is referring to 20% of the dataset volume, including both keys and values, not 20% of the dataset keyspace. The nature of our dataset was such that the more frequently a key was accessed, the larger its corresponding value was, and we had a long tail of keys with values that were only a few bytes in total.
While we were accessing 20% of our keyspace 80% of the time, that 20% accounted for over 50% of our total dataset volume since those keys also had the largest values. Redis on Flash is actually optimal for use cases where the largest values are the least frequently used, and therefore can be spilled to disk—the complete opposite of our pattern of spilling the smallest values to disk.
The Monday morning disaster we were seeing was a result of this poor fit: our keys were taking up all available memory, and so the Redis instance was spilling all the values to disk. We had to fetch the values from disk for every operation.
The Elasticache instances were able to sustain performance despite this additional overhead during lower-traffic weekends, but when our traffic increased on Monday morning, the combination of handling more write operations on top of the additional disk fetches was enough to overwhelm the CPU. The Redis server didn’t have sufficient CPU capacity to serve all requests and definitely didn’t have enough CPU capacity to sustain the background replication process that kept our read-replicas up to date.
Redis on Flash would not work out for us after all.
Option 3: Redis Cluster Mode
After our initial misfire with Redis on Flash, we realized we had no choice but to scale horizontally with Redis Cluster. Horizontal scaling means the dataset is split into chunks, typically referred to as “shards”, and each chunk is managed by a separate Redis server. The entire dataset no longer needs to be able to fit in the memory of a single instance and can instead be shared between up to dozens of instances.
Horizontal scaling with Redis Cluster significantly increases the size limit for a single Redis dataset but comes with a few caveats that were enough to give us pause and try the data-tiering approach first.
The first caveat is that Redis Cluster requires implementing a new client that has logic built in to determine which shard a given key-value pair belongs to and which server stores that shard. Fortunately for us, this client is offered by both the standard redis-py library as well as a standalone redis-cluster-py library that has since been merged into redis-py, so this didn’t pose too much of a challenge. Even still, we would have to swap in the new client library, which is not without risk.
The other potential issue we foresaw with using Redis Cluster was our reliance on Redis pipelining. We use pipelining to keep our per-message processing time low, but pipelining gets tricky when you introduce sharding.
Fortunately, the Redis cluster clients had a clever way of handling pipelining that came to the rescue. The client has logic to break the pipelined request up into subrequests containing all the keys for a given shard, and then execute those requests effectively in parallel. We could maintain pipelining and use Redis Cluster after all!
We set up a new Elasticache cluster with Cluster Mode (the AWS-managed Redis Cluster offering) enabled and backfilled our counts data onto the new instance. After some benchmarking to ensure the performance metrics of Redis Cluster still met our requirements, we redirected our production reads and writes to the new cluster.
Solution and Key Findings
We had finally found a solution to our scaling problem. A couple of key learnings along the way from our misadventures with RoF:
For open-source projects like Redis, if you’re planning on using a managed service like Elasticache, try to find the original open-source documentation for any new features in the managed service. They’re often more explicit about the potential pitfalls of the feature and less inclined to “sell” you on the new functionality.
Set up a subscaled, end-to-end replica of your production environment (sometimes referred to as a “tracer-bullet”) to validate the viability of a new system before investing time in long-term best practices like infrastructure-as-code or fully building out individual components. Try to have this environment mirror expected production traffic load, scaled relative to the size of your replica.
Try to understand the inner workings of a new system as best you can before making any determinations about whether it will be a good fit for your use case. This deeper understanding will likely save you from trouble in the short run and unlock hidden opportunities for optimization in the long run.
Redis Cluster would give us the ability to scale significantly beyond our former 636GB limit, enough runway for at least a few years, and still maintain the low-latency operation performance required for our system. Each time we approach the storage limit of our Redis cluster, we can add an additional shard and corresponding node to increase our capacity by another 636 GB.
Redis ClusterMode supports up to 500 shards for a theoretical* limit of 340 TB of data, which is over a 500x increase in total capacity. (*We’d run into other system scalability problems before ever reaching this theoretical limit) We took a moment to “stretch our legs” and enjoy the massive increase in storage capacity.
If these problems interest you, join us! Check out our careers page for open roles and to apply.
Bonus: Redis Pipelining
For each message we process through our scoring engine, we need to execute reads from our Elasticache cluster for potentially dozens of keys. If we were to execute those serially, sending each request and waiting for the server response, this would significantly increase the time it takes to “hydrate” each message with the counts from Redis, and in turn the overall processing time for each message.
For Abnormal, these delays increase the time our customers are exposed to email attacks—an unacceptable outcome for both Abnormal and our customers. So how can we make all these Redis lookups without causing processing delays? For that, we turn to Redis pipelining.
Pipelining is a functionality offered by Redis client libraries that allows client-side “batching” of Redis operations. Each operation is added to a “pipeline” and enqueued client-side; when all the operations have been added, the user can “execute” the pipeline. This sends a single request to the Redis server with all the requests, which are then executed by the server and returned in a single response.
This cuts the total number of client-server network round trips from the total number of counts required per message to just one, which can significantly cut down the per-message processing time. For example, if the network round-trip time is 50ms, and each message requires 100 count features, then the total time per message with pipelining is 50ms instead 5 seconds without—a significant difference!
Things get trickier with Redis Cluster. Since a pipeline can be composed of operations on a group of different keys, there is no guarantee all of those keys reside on the same Redis server. As such, if we try to make a pipelined request that includes keys on another server on multiple servers to a single Redis Cluster server, the server will throw an error.
We could avoid this by breaking down our pipelined requests back into serial, single-operation requests, but then we would significantly impair processing time. We could write additional client-logic to parallelize these requests, but that would require a significant time investment—time that we didn’t have to spare. Check out the clever way of handling pipelining the Redis Python libraries implemented to address this problem!