Saving Memory on a Python Database Server

December 7, 2021

Previously this year, Dhurv Purushottam, our Data Platform lead, wrote about how we successfully scaled up one of our most crucial, data-intensive services with low engineering cost. He wrote about scaling an in-memory dataset by moving it into its own micro-service, and this has proved to be effective, especially given the constraints of money and (precious) engineering time.

At a hyper-growth startup, a solution from six months ago will unfortunately no longer scale. The business is growing rapidly, and this traffic to this service in particular was growing at an unprecedented rate. We hit a point where it needed re-architecting to support 10x the current scale.

Due to its importance to our critical code-path that protects our customers, we wanted to ensure that the service didn’t hit its resource capacity before the next generation of the service was designed and implemented. We had to buy some time by identifying the high resource consumption of the service.

About PersonDB

“First, a few details about this dataset: PersonDB. PersonDB contains data about our customers’ employees—the users that Abnormal protects. With each email that Abnormal processes, the sender and recipient information is matched against PersonDB, and whether or not there’s a match affects how we score that message. The score is a deciding factor in whether we consider the message a threat to the recipient.”
- Outgrowing Your In-Memory DB and Scaling with Microservices.

When we noticed PersonDB’s memory usage almost hitting capacity, it was set up as a microservice, where we run multiple copies of the DB.

Each instance of PersonDB is a Thrift-based Python server. Each of these instances would load up uncompressed CSV data and load them into internal Python data structures, resulting in a base memory usage of around 24GB out of the available 30GB.

PersonDB is powered by Gunicorn, a WSGI HTTP Server. Gunicorn uses a pre-fork worker model, which means that it manages a set of worker threads that handle requests concurrently. Gunicorn allows PersonDB to process thousands of requests per second from real-time applications. Each client talks to a Gunicorn thread, which in turn queries the in-memory PersonDB data.

PersonDB architecture set up

A rough architecture of how PersonDB is set up.

Issue: High Memory Usage

The memory usage of PersonDB was at around ~80%, up from 65% a couple of months ago. This growth was due to the sheer increase in the number of customers that we are protecting.

A graph of PersonDB's high memory usage

77.8% of the memory is used when the service is deployed.

In addition, we saw error rates breaching our four nines SLA. This meant that less than 99.99% of requests to PersonDB were successful. With our optimistic growth projections, this gave us at most two weeks to figure out a way to scale up the service.

Deep Dive and Repeated Learnings

In reality, this wasn’t the first time we had faced high resource usage, as we’d had the same problem just a few months prior to this. It seemed like the memory usage was growing slowly over time, almost like a memory leak. The only thing that was stopping all the memory from being used was the fact that we rotate our servers every night to update PersonDB’s dataset.

PersonDB memory usage decreasing after a re-deployment

The memory usage grew over time, until a re-deployment was triggered.

We thought that the individual requests sent by customers were causing a memory leak. To investigate this, we tried profiling each request using tracemalloc but couldn’t find any data structures created by the individual requests that could be leaking memory.

This meant that it was our data store that was leaking memory. This was hard to believe due to the fact that we only ever read from the data store, and never write to it. Upon running the “top” command, our team lead Michael noticed that the “CoW” column increased steadily as requests came in. This corresponded to the increase in total memory usage.

A quick recap: Copy-On-Write (CoW) is a mechanism where forked processes/threads can lazily copy all of their parents’ memory. This means that copying a memory page is deferred until the child process/thread needs to write to that particular memory page. When this happens, it is called a Copy-On-Write fault.

Recall that Gunicorn manages a set of worker threads to handle requests. This means that there is a central Gunicorn process that spawns worker threads to handle incoming requests. We initialized our data store in the main process that coordinated the workers. Then, the workers handled requests independently while reading PersonDB data from the coordinator process.

We decided to first run PersonDB locally with a smaller dataset. We see the following tree in the interactive htop command:

PersonDB Gunicorn coordinator forking 4 worker threads

PersonDB’s Gunicorn coordinator forks 4 worker threads. The number of threads is customizable.

And the output from “top” showed that there were no significant Copy-On-Write faults upon initializing PersonDB.

PersonDB copy-on-write faults

In order to simulate our production environment, we built a small load-testing script that fires thousands of requests per second. While blasting the DB with 100,000 requests, we noticed the memory usage shot up such that each worker used almost as much memory as the coordinator thread.

PersonDB memory usage test

And checking the top output, the number of CoW faults was exceedingly high. This baffled us because each gunicorn worker didn’t actually write to the DB.

Top output CoW faults

Until we could dive deeper, we knew that:

  1. There was clearly something fishy going on with the workers having CoW from the coordinator process.

  2. The fewer the workers, the lower the amount of CoW.

This helped us reduce CoW fault occurrence by 4x, just by reducing the number of workers from four to one. Since we had reduced the number of workers, we increased the number of PersonDB servers to compensate for the 4x drop in throughput. The trade-off in cost wasn’t much, so it was worth it to get us out of the danger zone.

The reduction in CoW faults caused a 2.5x reduction in memory usage. While this actually didn’t solve the problem completely, it bought us a few months worth of time. This allowed our team, which was four engineers at the time, to focus on more pressing issues.

Applying our Learnings Again

Fast forward to now, when the same creeping memory usage was appearing. This was, once again, due to our rapid growth. We saw that traffic to PersonDB almost doubled in volume in four months. In addition, the dataset for PersonDB more than doubled in size. This was causing us to hit our resource capacities again.

We knew that the memory usage pattern hinted at the CoW issue that we saw a couple of months ago. We decided that we could invest a little more time into PersonDB and solve the “memory leak” for good. But once again, we were constrained by time. We had to solve this before investing in the next-generation service so that downstream services (and more importantly, our customers) wouldn’t be affected.

When running PersonDB to study the htop and top outputs, we saw that even though there was one worker, it was indeed seeing a bunch of CoW faults. Once again, Copy-On-Write faults didn’t make sense for a read-only database.

It helps to understand exactly why CoW faults were happening here. After a bit of digging, we found this article by Instagram, where they figured out that Copy-On-Read was happening on their pre-fork server. Copy-On-Read means that the memory was being copied by the child processes/threads upon being accessed—which is precisely what we were seeing in PersonDB.

Copy-On-Read was occuring because Python keeps a reference count for every object for the sake of garbage collection. These reference counts are located within the data structure representing the object. In effect:

  • Each object in PersonDB had a reference count, managed by the Python runtime.

  • Each worker thread was accessing PersonDB data in the coordinator thread. They used this data to process incoming requests.

  • Each time a memory page on the coordinator thread was accessed, the Python runtime incremented those reference counts.

  • Incrementing the reference count means that the variable has to get overwritten.

  • Because the worker thread was writing to those reference counts in the coordinator process’ memory, a CoW fault gets triggered.

We, unfortunately, did not have the runway to work around the Python Garbage Collector. In the time that it would take to do a deeper dive into the CoW issue, we could have even made significant progress in re-architecting PersonDB into its next generation. One of the biggest learnings from my time at Abnormal is that it’s worth the time and effort to identify the highest ROI problems before jumping in to solve them.

An Unexpected Solution

To that end, we decided to scrap Gunicorn entirely. In reality, our business logic was built on the Bottle framework. Bottle is decoupled from Gunicorn—Bottle is a web framework that handles the application code, while Gunicorn is the WSGI server that handles raw HTTP requests. Upon getting initialized by Bottle, Gunicorn calls request handlers defined in the Bottle app to handle incoming HTTP requests.

We did a bit of digging and found out that switching the server away from Gunicorn was pretty simple. We happened to be using Gevent’s greenlets, which are cooperatively scheduled user threads, for each Gunicorn worker. We opted to use Gevent’s built-in WSGI server to prevent any CoW faults. Since we only had one worker for Gunicorn a coordinator/worker architecture was redundant.

Of course, in order to do this properly, we had to set up a staging environment and load testing framework to validate the improvement in memory usage. Changing the backing server from Gunicorn to another WSGI server could pose a risk to our production environment, so it was crucial that we discover any issues with gevent’s WSGI server in a safe environment.

After some thorough testing, we verified that it could handle production traffic and didn’t have the memory creep issue.

A graph showing memory usage decreasing

Unexpected Performance Gains

We saw that on top of the memory leak being fixed, we were seeing far lower error rates. Recall that the errors were breaching our four nines SLA when we used Gunicorn.

Graph showing performance gains and lower error rates

Since Gunicorn follows a pre-fork model, it has to run a bunch of management code over its workers. Managing threads is not a trivial task. Their server arbiter is built for synchronous code, running a loop where it does an almost busy-wait and manages all the workers, which is sometimes referred to as a “control loop”. Since there is only one worker for one coordinator, we were paying the price of having a coordinator/worker architecture without reaping the benefits of it. In addition, each worker thread blocks while it handles the requests, causing new incoming requests to be stored in a fixed-length backlog by the coordinator.

Bottle states, “Greenlets behave similar to traditional threads, but are very cheap to create. A gevent-based server can spawn thousands of greenlets (one for each connection) with almost no overhead. Blocking individual greenlets has no impact on the servers ability to accept new requests. The number of concurrent connections is virtually unlimited.”

Gevent’s WSGIHandler builds on top of StreamServer, a generic TCP server, to accept requests and handle them accordingly. Even though we always have a while-loop to handle incoming requests, the difference is that greenlets are non-blocking and easy to create, allowing the server to accept many more requests than Gunicorn.

In our case, since we don’t do I/O operations, the benefit is that there is no overhead of thread management. In addition, there is no explicit queue management by the WSGI server. The queue is implicitly defined by the greenlets spawned (one for each socket).

The lightweight nature of greenlets combined with the larger request backlog translates to requests being handled faster, with fewer requests getting dropped by the server. We’ll take the free win, especially since it saves us from having to invest more precious engineering resources on PersonDB.

Conclusions

Soon enough, we expect to hit resource capacity again and would have to invest in the next generation of PersonDB, which must be able to handle 10x the throughput of the current version. We’re working on the design now, but we would probably shard the dataset over multiple PersonDB instances, allowing for faster lookups and lower load per instance. Until we roll that out, the improvements we have made to PersonDB’s server will allow us to keep the service performant and reliable enough to power our real-time scoring systems.

In the meantime, some conclusions from this project include:

  • Infrastructure engineering, especially in a hyper-growth startup, involves harnessing our understanding of technologies as well as making optimal decisions to handle the business’ scale.

  • It is very important to understand the root cause of a problem before jumping into building solutions. The simple act of observing and measuring the CoW faults enabled us to act quickly and reduce the resource consumption of PersonDB.

  • Even if we understand the root cause of a problem, opting for a quicker solution and buying a bit more time may be more optimal than tackling the root cause head-on.

  • Be careful when sharing memory across threads in Python, as it could be no better than copying the memory in the first place.

There are plenty of interesting problems relating to scalability and reliability that we haven’t solved yet. The Platform & Infrastructure team wants to empower engineers in the organization by scaling up all our core systems and enabling engineers to shift their focus to customer problems. If our work sounds exciting to you, we are hiring!

To learn more about our engineering team, check out our Engineering Spotlights or see our current open positions to apply now.

Demo 2x 1

See the Abnormal Solution to the Email Security Problem

Protect your organization from the attacks that matter most with Abnormal Integrated Cloud Email Security.

Related Posts

B 06 21 22 Threat Intel blog
Executives are no longer the go-to impersonated party in business email compromise (BEC) attacks. Now, threat actors are opting to impersonate vendors instead.
Read More
B 06 7 22 Disentangling ML Pipelines Blog
Learn how explicitly modeling dependencies in a machine learning pipeline can vastly reduce its complexity and make it behave like a tower of Legos: easy to change, and hard to break.
Read More
B 04 07 22 SEG
As enterprises across the world struggle to stop modern email attacks, it begs the question: how are these attacks evading traditional solutions like SEGs?
Read More
Enhanced Remediation Blog Cover
The most effective way to manage spam and graymail is to leverage a cloud-native, API-based architecture to understand identity, behavior, and content patterns.
Read More
B 05 16 22 VP of Recruiting
We are thrilled to announce the addition of Mary Price, our new Vice President of Talent. Mary will support our continued investment in the next generation of talent here at Abnormal.
Read More
B 06 01 22 Stripe Phishing
In this sophisticated credential phishing attack, the threat actor created a duplicate version of Stripe’s entire website.
Read More
B Podcast Engineering9
In episode 9 of Abnormal Engineering Stories, Dan sits down with Mukund Narasimhan to discuss his perspective on productionizing machine learning.
Read More
B 05 31 22 RSA Conference
Attending RSA Conference 2022? So is Abnormal! We’d love to see you at the event.
Read More
B 05 27 22 Active Ransomware Groups
Here’s an in-depth analysis of the 62 most prominent ransomware groups and their activities since January 2020.
Read More
B 05 24 22 ESI Season 1 Recap Blog
The first season of Enterprise Software Innovators (ESI) has come to a close. While the ESI team is hard at work on season two, here’s a recap of some season one highlights.
Read More
B 05 13 22 Hiring Experience
Abnormal Security is committed to offering an exceptional experience for candidates and employees. Hear about our recruiting and onboarding firsthand from three Abnormal employees.
Read More
B 05 11 22 Scaling Out Redis
As we’ve scaled our customer base, the size of our datasets has also grown. With our rapid expansion, we were on track to hit the data storage limit of our Redis server in two months, so we needed to figure out a way to scale beyond this—and fast!
Read More