Saving Memory on a Python Database Server
Previously this year, Dhurv Purushottam, our Data Platform lead, wrote about how we successfully scaled up one of our most crucial, data-intensive services with low engineering cost. He wrote about scaling an in-memory dataset by moving it into its own micro-service, and this has proved to be effective, especially given the constraints of money and (precious) engineering time.
At a hyper-growth startup, a solution from six months ago will unfortunately no longer scale. The business is growing rapidly, and this traffic to this service in particular was growing at an unprecedented rate. We hit a point where it needed re-architecting to support 10x the current scale.
Due to its importance to our critical code-path that protects our customers, we wanted to ensure that the service didn’t hit its resource capacity before the next generation of the service was designed and implemented. We had to buy some time by identifying the high resource consumption of the service.
“First, a few details about this dataset: PersonDB. PersonDB contains data about our customers’ employees—the users that Abnormal protects. With each email that Abnormal processes, the sender and recipient information is matched against PersonDB, and whether or not there’s a match affects how we score that message. The score is a deciding factor in whether we consider the message a threat to the recipient.”
- Outgrowing Your In-Memory DB and Scaling with Microservices.
When we noticed PersonDB’s memory usage almost hitting capacity, it was set up as a microservice, where we run multiple copies of the DB.
Each instance of PersonDB is a Thrift-based Python server. Each of these instances would load up uncompressed CSV data and load them into internal Python data structures, resulting in a base memory usage of around 24GB out of the available 30GB.
PersonDB is powered by Gunicorn, a WSGI HTTP Server. Gunicorn uses a pre-fork worker model, which means that it manages a set of worker threads that handle requests concurrently. Gunicorn allows PersonDB to process thousands of requests per second from real-time applications. Each client talks to a Gunicorn thread, which in turn queries the in-memory PersonDB data.
Issue: High Memory Usage
The memory usage of PersonDB was at around ~80%, up from 65% a couple of months ago. This growth was due to the sheer increase in the number of customers that we are protecting.
In addition, we saw error rates breaching our four nines SLA. This meant that less than 99.99% of requests to PersonDB were successful. With our optimistic growth projections, this gave us at most two weeks to figure out a way to scale up the service.
Deep Dive and Repeated Learnings
In reality, this wasn’t the first time we had faced high resource usage, as we’d had the same problem just a few months prior to this. It seemed like the memory usage was growing slowly over time, almost like a memory leak. The only thing that was stopping all the memory from being used was the fact that we rotate our servers every night to update PersonDB’s dataset.
We thought that the individual requests sent by customers were causing a memory leak. To investigate this, we tried profiling each request using tracemalloc but couldn’t find any data structures created by the individual requests that could be leaking memory.
This meant that it was our data store that was leaking memory. This was hard to believe due to the fact that we only ever read from the data store, and never write to it. Upon running the “top” command, our team lead Michael noticed that the “CoW” column increased steadily as requests came in. This corresponded to the increase in total memory usage.
A quick recap: Copy-On-Write (CoW) is a mechanism where forked processes/threads can lazily copy all of their parents’ memory. This means that copying a memory page is deferred until the child process/thread needs to write to that particular memory page. When this happens, it is called a Copy-On-Write fault.
Recall that Gunicorn manages a set of worker threads to handle requests. This means that there is a central Gunicorn process that spawns worker threads to handle incoming requests. We initialized our data store in the main process that coordinated the workers. Then, the workers handled requests independently while reading PersonDB data from the coordinator process.
We decided to first run PersonDB locally with a smaller dataset. We see the following tree in the interactive
And the output from “top” showed that there were no significant Copy-On-Write faults upon initializing PersonDB.
In order to simulate our production environment, we built a small load-testing script that fires thousands of requests per second. While blasting the DB with 100,000 requests, we noticed the memory usage shot up such that each worker used almost as much memory as the coordinator thread.
And checking the
top output, the number of CoW faults was exceedingly high. This baffled us because each gunicorn worker didn’t actually write to the DB.
Until we could dive deeper, we knew that:
There was clearly something fishy going on with the workers having CoW from the coordinator process.
The fewer the workers, the lower the amount of CoW.
This helped us reduce CoW fault occurrence by 4x, just by reducing the number of workers from four to one. Since we had reduced the number of workers, we increased the number of PersonDB servers to compensate for the 4x drop in throughput. The trade-off in cost wasn’t much, so it was worth it to get us out of the danger zone.
The reduction in CoW faults caused a 2.5x reduction in memory usage. While this actually didn’t solve the problem completely, it bought us a few months worth of time. This allowed our team, which was four engineers at the time, to focus on more pressing issues.
Applying our Learnings Again
Fast forward to now, when the same creeping memory usage was appearing. This was, once again, due to our rapid growth. We saw that traffic to PersonDB almost doubled in volume in four months. In addition, the dataset for PersonDB more than doubled in size. This was causing us to hit our resource capacities again.
We knew that the memory usage pattern hinted at the CoW issue that we saw a couple of months ago. We decided that we could invest a little more time into PersonDB and solve the “memory leak” for good. But once again, we were constrained by time. We had to solve this before investing in the next-generation service so that downstream services (and more importantly, our customers) wouldn’t be affected.
When running PersonDB to study the
top outputs, we saw that even though there was one worker, it was indeed seeing a bunch of CoW faults. Once again, Copy-On-Write faults didn’t make sense for a read-only database.
It helps to understand exactly why CoW faults were happening here. After a bit of digging, we found this article by Instagram, where they figured out that Copy-On-Read was happening on their pre-fork server. Copy-On-Read means that the memory was being copied by the child processes/threads upon being accessed—which is precisely what we were seeing in PersonDB.
Copy-On-Read was occuring because Python keeps a reference count for every object for the sake of garbage collection. These reference counts are located within the data structure representing the object. In effect:
Each object in PersonDB had a reference count, managed by the Python runtime.
Each worker thread was accessing PersonDB data in the coordinator thread. They used this data to process incoming requests.
Each time a memory page on the coordinator thread was accessed, the Python runtime incremented those reference counts.
Incrementing the reference count means that the variable has to get overwritten.
Because the worker thread was writing to those reference counts in the coordinator process’ memory, a CoW fault gets triggered.
We, unfortunately, did not have the runway to work around the Python Garbage Collector. In the time that it would take to do a deeper dive into the CoW issue, we could have even made significant progress in re-architecting PersonDB into its next generation. One of the biggest learnings from my time at Abnormal is that it’s worth the time and effort to identify the highest ROI problems before jumping in to solve them.
An Unexpected Solution
To that end, we decided to scrap Gunicorn entirely. In reality, our business logic was built on the Bottle framework. Bottle is decoupled from Gunicorn—Bottle is a web framework that handles the application code, while Gunicorn is the WSGI server that handles raw HTTP requests. Upon getting initialized by Bottle, Gunicorn calls request handlers defined in the Bottle app to handle incoming HTTP requests.
We did a bit of digging and found out that switching the server away from Gunicorn was pretty simple. We happened to be using Gevent’s greenlets, which are cooperatively scheduled user threads, for each Gunicorn worker. We opted to use Gevent’s built-in WSGI server to prevent any CoW faults. Since we only had one worker for Gunicorn a coordinator/worker architecture was redundant.
Of course, in order to do this properly, we had to set up a staging environment and load testing framework to validate the improvement in memory usage. Changing the backing server from Gunicorn to another WSGI server could pose a risk to our production environment, so it was crucial that we discover any issues with gevent’s WSGI server in a safe environment.
After some thorough testing, we verified that it could handle production traffic and didn’t have the memory creep issue.
Unexpected Performance Gains
We saw that on top of the memory leak being fixed, we were seeing far lower error rates. Recall that the errors were breaching our four nines SLA when we used Gunicorn.
Since Gunicorn follows a pre-fork model, it has to run a bunch of management code over its workers. Managing threads is not a trivial task. Their server arbiter is built for synchronous code, running a loop where it does an almost busy-wait and manages all the workers, which is sometimes referred to as a “control loop”. Since there is only one worker for one coordinator, we were paying the price of having a coordinator/worker architecture without reaping the benefits of it. In addition, each worker thread blocks while it handles the requests, causing new incoming requests to be stored in a fixed-length backlog by the coordinator.
Bottle states, “Greenlets behave similar to traditional threads, but are very cheap to create. A gevent-based server can spawn thousands of greenlets (one for each connection) with almost no overhead. Blocking individual greenlets has no impact on the servers ability to accept new requests. The number of concurrent connections is virtually unlimited.”
Gevent’s WSGIHandler builds on top of StreamServer, a generic TCP server, to accept requests and handle them accordingly. Even though we always have a while-loop to handle incoming requests, the difference is that greenlets are non-blocking and easy to create, allowing the server to accept many more requests than Gunicorn.
In our case, since we don’t do I/O operations, the benefit is that there is no overhead of thread management. In addition, there is no explicit queue management by the WSGI server. The queue is implicitly defined by the greenlets spawned (one for each socket).
The lightweight nature of greenlets combined with the larger request backlog translates to requests being handled faster, with fewer requests getting dropped by the server. We’ll take the free win, especially since it saves us from having to invest more precious engineering resources on PersonDB.
Soon enough, we expect to hit resource capacity again and would have to invest in the next generation of PersonDB, which must be able to handle 10x the throughput of the current version. We’re working on the design now, but we would probably shard the dataset over multiple PersonDB instances, allowing for faster lookups and lower load per instance. Until we roll that out, the improvements we have made to PersonDB’s server will allow us to keep the service performant and reliable enough to power our real-time scoring systems.
In the meantime, some conclusions from this project include:
Infrastructure engineering, especially in a hyper-growth startup, involves harnessing our understanding of technologies as well as making optimal decisions to handle the business’ scale.
It is very important to understand the root cause of a problem before jumping into building solutions. The simple act of observing and measuring the CoW faults enabled us to act quickly and reduce the resource consumption of PersonDB.
Even if we understand the root cause of a problem, opting for a quicker solution and buying a bit more time may be more optimal than tackling the root cause head-on.
Be careful when sharing memory across threads in Python, as it could be no better than copying the memory in the first place.
There are plenty of interesting problems relating to scalability and reliability that we haven’t solved yet. The Platform & Infrastructure team wants to empower engineers in the organization by scaling up all our core systems and enabling engineers to shift their focus to customer problems. If our work sounds exciting to you, we are hiring!