How We Overhauled Our Architecture to Handle 100x Scale: Python to Golang, Celery to Kafka
As a rapidly growing email security startup, we recognize just how essential flexibility and adaptability are to our success. Protecting an ever-increasing number of organizations from advanced threats like business email compromise (classified by the FBI as one of the most financially damaging cybercrimes) depends on not only our ability to properly scale but also troubleshoot growing pains.
To keep employee inboxes secure, Abnormal Security processes all mail to filter out undesirable messages—including, but not limited to, phishing, spoofing, and malware. While we are certainly excited to see that our customer base is expanding (since it reflects the increasing demand for our solution), this growth also creates challenges. In particular, it has vastly increased the volume of mail we go through daily across our services, leading to scaling issues and frequent outages.
This was especially significant in one service, Unwanted Mail, which filters all mail to determine actions to take on messages we detect as unwanted, such as spam and promotional mail.
In part one of this blog post, we share why we decided to overhaul the Unwanted Mail service, which was originally built with Python and Celery as its broker. In part two, we will discuss our process and the benefits we reaped.
Unwanted Mail Service
On a high level, the Unwanted Mail service (which we’ll refer to as “UM”) evaluates incoming mail to identify and take action on spam and promotional mail. Over time, the service intelligently infers user preferences by observing how they interact with these messages, intelligently building safelists and blocklists for future mail.
UM processes both safe mail and unwanted mail to output message movement decisions. Given that an average person receives 100-120 emails a day, and Abnormal’s customers have an average of approximately 1000 mailboxes, every new customer onboarded to Abnormal results in an increase in hundreds of thousands of messages for the UM service.
Here’s a quick chart to show how UM has grown over the first six months of 2022. The following shows the increase in requests UM received over a two-week period. We started to experience scaling problems with our current architecture at around the 30 million mark as shown below.
Why Our Current Solution Just Wasn’t Working
“An outage a day keeps customers away.”
Python and Celery are an integral part of Abnormal services. All of our products, including UM, were implemented and maintained by the asynchronous model Celery provides. The diagram below shows the original UM architecture.
In the early startup days, Python and Celery were great in that they allowed us to quickly build products and iterate quickly. However, as the company grew, our engineering vision changed from building fast to building scalable solutions fast.
We found that our current architecture could not meet our needs for the following reasons:
Due to high resource overhead, we could not have a high number of concurrent workers per task. To increase processing parallelism, we had to increase the number of tasks in our cluster, which was extremely costly and not sustainable.
As we reached task scaling limitations, our Celery peak queue length grew longer while traffic increased. We were starting to hit Redis broker capacity issues as Celery’s Redis broker started hitting memory capacity, which led to it denying enqueue requests entirely. Although we had increased the Redis cache size twice in an attempt to alleviate this problem, we knew we were going to reach the end of our runway very soon.
A single point of failure on the Redis broker backing Celery meant that if that ever goes down, the whole queue goes “poof”. As various services were also sharing the same broker, this also meant that long processing times on other services reduced availability on our own service.
Redis is an in-memory cache. If the Redis broker goes down completely, the queued tasks are lost and will never be processed. While we had instrumented solutions as workarounds, these were unsustainable long term.
Celery workers were experiencing mysterious minor memory leaks and needed to be frequently rotated. This not only increased the amount of maintenance effort required but also meant that we had to minimize concurrency to free up some memory space before memory leak caused task performance to degrade.
Furthermore, as this architecture could not handle the volume, we had to preprocess messages from within our upstream Notifications Processor system before they were sent to separate async Python workers via Celery. This meant that business logic was spread throughout the codebase, decreasing maintainability and complicating monitoring. Python and Celery were great, but we needed to find a better long-term solution.
Challenges in Choosing (and Implementing) a New Solution
“It can’t be that easy, right?”
With the problem properly framed, we now had to research and implement a better solution that would last. There were several challenges associated with rewriting this service, especially a critical one like UM:
We had a new feature coming that would at least double the current traffic
We were still growing fast and needed a solution that would be able to hit the ground running
We wanted a solution with longevity, not just a patch on an existing solution that will save us for the next three months but lasts for the next three years
It was architecturally impossible to meet traffic and reliability demands with the current architecture without overhauling upstream systems. Even with tweaks and optimizations on Celery, Redis broker, and our consumer tasks, we weren’t going to scale up to projected traffic by the end of the year.
All things considered, the signs pointed to the urgent need for a rewrite before we hit the catastrophic failure point—even if rewriting the service was not a product priority.
Why Golang + Kafka
“A step towards a brave new world.”
The new design we arrived at is represented in the diagram below. In the new architecture, we have replaced Celery with Kafka as our broker and rewritten our workers in Golang.
Benefits of Golang
There were several benefits of using Golang over Python for our production service. Here are the five biggest ones, ordered by impact and importance:
Low concurrency overhead thanks to goroutines and channels. This means we can have a much larger number of concurrent workers per task (spoiler alert: we are comfortably at 50x) and can increase message consumption throughput without increasing size.
Low deployment overhead. This is thanks to low build times, small binary size, and much faster and more reliable dependency resolution.
Static typing. By being verbose on what type of data is being passed through and processed within code flow, we can catch typing errors easily on compilation, allowing code to be protected from runtime errors most of the time.
Native pooling support for Redis and SQL clients. This allows us to manage connection pools efficiently.
Speed. Golang code just runs faster than Python (given the same business logic). Small gains on processing time of one process leads to incrementally large gains on the whole service as QPS increases.
Benefits of Kafka
We decided on Kafka for the following reasons:
High Queue Capacity
We did a preliminary test to quickly measure queue capacity by setting up our upstream system to produce 10x the amount of traffic we currently send to Celery into a topic on a Kafka cluster. In our initial test, we found that the Kafka cluster reached 200x capacity at approximately 2x the daily cost of the Redis broker behind Celery. This was as expected, as Kafka brokers store incoming messages into a persistent data storage that is orders of magnitude larger than the in-memory storage a Redis broker provides.
This came in 3 levels.
Multiple brokers: Taking replication configuration out of the picture, even if a broker goes down, other partitions are still running, meaning the queue is never entirely dead.
Multiple AZs per broker: If we get AZ-level outages, there are still brokers residing in other AZs, so the entire cluster will never go down completely.
Replication: Bringing replication back into the picture, by setting replication configuration properly, if a broker goes down, another broker with an in-sync replica of the partition will naturally be elected as the leader and process messages in the partition. If set properly, there would be minimal chance of outages, partial or otherwise—even if individual brokers do go down.
Kafka has a persistent log store. Even if the entire cluster goes down due to some major disaster, when brokers come back online they would just read from the persisted log storage and resume processing of unprocessed messages.
A Successfully Scaled Solution
With these considerations in place, the benefits of Golang and Kafka were pretty clear. We would like to thank Praveen Bathala for giving us advice and guidance for this project, without which this would not have been possible.
In the second part of this blog post, we will discuss how we effectively migrated the system to this new architecture without any downtime for customers and showcase how this significantly improved our systems.
As a fast-growing company, we have lots of interesting engineering problems to solve, just like this one! If these problems interest you, and you want to further your growth as an engineer, we’re hiring! Learn more at our careers website.