Rearchitecting a System: Performing a Migration on a Production System With Zero Downtime
“This will be easy, they said.”
Any engineer with experience in migrating from one system to another will tell you the process is anything but straightforward.
In a recent blog post, we shared how rapid growth in our customer base had led to an exponential increase in the email volume being processed by our Unwanted Mail service. We explained the challenges of scaling up to accommodate this growth as well as the shortcomings of the existing Python and Celery architecture for increasing our traffic capacity by a hundredfold.
We also provided a detailed analysis of the engineering problems we were facing, and how we explored and determined that a system built on Golang and Kafka would provide a much more reliable platform that will be more performant at scale.
In this second post, we’ll take a deep dive into how we performed the migration of traffic from the legacy system and ensured a smooth rollout process with no customer downtime. We will also take some time to quantify the improvements this migration had on our product.
The Game Plan
“Failing to plan is planning to fail.”
With a new architecture planned, we set out to isolate the portions that needed to be rewritten as well as plan a smooth rollout process to minimize (or eliminate) downtime. A proper plan was required because there were still ongoing feature requests and limited resources available. Given that the cons were mostly in development time, we decided to bite the bullet and dived into a rewrite.
Migrating From One Solution to the Next
No rewrite is perfect and expecting a perfect rewrite is unrealistic. We knew we needed a good migration process in place to ensure each key component was thoroughly tested and then rolled out with minimal downtime.
In the graphics below we share a high-level overview of our five-stage rollout process, which we hope will provide a reference point for eager learners.
Migration Stage 1: Mirroring Volume
To ensure our Kafka setup was correct, we first mirrored the volume on existing architecture onto Kafka to ensure we provisioned infrastructure limits correctly. We then used this opportunity to address configuration concerns and data issues.
Migration Stage 2: Mirroring Requests
Next, we mirrored requests onto a staging environment with our Golang workers implementation. This allowed us to ensure that the services could be built properly and also gave us a safe environment to test out bugs and use cases extensively.
Migration Stage 3: Comparing Metrics for Feature Parity
Once the previous two stages were complete, we mirrored these requests onto a production instance using the new architecture, disabled actions that mutate data, and recorded metrics. By comparing these metrics with our current service we could make an assessment about feature parity.
Migration Stage 4: Gradual Customer Migration
Once we had reasonable confidence that the new architecture was stable, we started gradually migrating customers over to the new architecture using feature flags. This process was spread out over a few days, with a small percentage being migrated each time and closely monitored.
Migration Stage 5: Complete Migration and Deprecate Old Flow
Finally, using a global feature flag, we completed the migration process to the new architecture. We stopped notifications from flowing toward the old service so we could mark it for deprecation.
One thing to note is that at each migration stage, key components were feature flagged so that we could make the switch back at any time to the old stable architecture without requiring a code push or service deployment. We also had a replay mechanism in place that allowed us to replay traffic across a time window (on either system) in the event that either service went down during the migration process.
The five-stage process and the above factors all combined to help us smoothly migrate this highly utilized service without any downtime!
“Mom, I made it!”
With the completion of this project and a smooth rollout, we instantly started reaping the benefits of our shiny new architecture:
- Deployment time was cut by more than half as we now had a smaller build image, fewer tasks to deploy, and faster build times thanks to Golang.
- Processing capacity increased by 100x compared to the previous architecture. This allowed us to remove the need for business logic in a previous step to filter requests and comfortably increase processing by 20x. Processing volume jumped from 200qps to 1000qps without breaking a sweat.
- P99 of dwell time of a request in queue was reduced from minutes (and at times hours) to less than one minute.
- Reduced load/latency on core system upstream. Message processing time was reduced by an average of 16%.
- Costs stayed the same despite increased processing volume.
- No more outages.
Lessons and Key Takeaways
“Learn from the mistakes of those who have walked before you.”
There are a ton of lessons that we learned from this self-initiated project that not just benefited the company, but us as engineers as well. Here are the key takeaways:
Understand your limitations.
- Depending on your product requirements, optimize for velocity of delivery or scale accordingly. Additionally, know your limitations and understand when you might need a redesign. A better understanding of system limitations will help you decide between a refactor versus a rewrite.
Don’t be afraid to throw away code.
- While designs often look good on paper, you can only know for sure what will work after implementing and testing out a PoC. Do not leave things to chance.
Know your infrastructure.
- True costs come from badly managed infrastructure. Know your limits and your usages. Could the architecture be revamped to use resources more efficiently? What is the growth runway of your system, and will you be in dire need of a refactor soon?
We would once more like to thank Praveen Bathala for giving us advice and guidance for this project, without which this would not have been possible.
As a fast-growing company, we’ve got lots of interesting engineering challenges to solve—not just the quick way, but the right way. If these problems interest you, and you would like to further your growth as an engineer, we’re hiring! Learn more at our careers website.
See the Abnormal Solution to the Email Security Problem
Protect your organization from the full spectrum of email attacks with Abnormal.