chat
expand_more

Rearchitecting a System: Performing a Migration on a Production System With Zero Downtime

We recently shared a look at how the Abnormal engineering team overhauled our Unwanted Mail service architecture to accommodate our rapid growth. Today, we’re diving into how the team migrated traffic to the new architecture—with zero downtime.
September 12, 2022

“This will be easy, they said.”

Any engineer with experience in migrating from one system to another will tell you the process is anything but straightforward.

In a recent blog post, we shared how rapid growth in our customer base had led to an exponential increase in the email volume being processed by our Unwanted Mail service. We explained the challenges of scaling up to accommodate this growth as well as the shortcomings of the existing Python and Celery architecture for increasing our traffic capacity by a hundredfold.

We also provided a detailed analysis of the engineering problems we were facing, and how we explored and determined that a system built on Golang and Kafka would provide a much more reliable platform that will be more performant at scale.

In this second post, we’ll take a deep dive into how we performed the migration of traffic from the legacy system and ensured a smooth rollout process with no customer downtime. We will also take some time to quantify the improvements this migration had on our product.

The Game Plan

“Failing to plan is planning to fail.”

With a new architecture planned, we set out to isolate the portions that needed to be rewritten as well as plan a smooth rollout process to minimize (or eliminate) downtime. A proper plan was required because there were still ongoing feature requests and limited resources available. Given that the cons were mostly in development time, we decided to bite the bullet and dived into a rewrite.

Migrating From One Solution to the Next

No rewrite is perfect and expecting a perfect rewrite is unrealistic. We knew we needed a good migration process in place to ensure each key component was thoroughly tested and then rolled out with minimal downtime.

In the graphics below we share a high-level overview of our five-stage rollout process, which we hope will provide a reference point for eager learners.

Migration Stage 1: Mirroring Volume

Rearchitecting a System 1 Mirroring Volume

To ensure our Kafka setup was correct, we first mirrored the volume on existing architecture onto Kafka to ensure we provisioned infrastructure limits correctly. We then used this opportunity to address configuration concerns and data issues.

Migration Stage 2: Mirroring Requests

Rearchitecting a System 2 Mirroring Requests

Next, we mirrored requests onto a staging environment with our Golang workers implementation. This allowed us to ensure that the services could be built properly and also gave us a safe environment to test out bugs and use cases extensively.

Migration Stage 3: Comparing Metrics for Feature Parity

Rearchitecting a System 3 Comparing Metrics

Once the previous two stages were complete, we mirrored these requests onto a production instance using the new architecture, disabled actions that mutate data, and recorded metrics. By comparing these metrics with our current service we could make an assessment about feature parity.

Migration Stage 4: Gradual Customer Migration

Rearchitecting a System 4 Gradual Customer Migration

Once we had reasonable confidence that the new architecture was stable, we started gradually migrating customers over to the new architecture using feature flags. This process was spread out over a few days, with a small percentage being migrated each time and closely monitored.

Migration Stage 5: Complete Migration and Deprecate Old Flow

Rearchitecting a System 5 Complete Migration

Finally, using a global feature flag, we completed the migration process to the new architecture. We stopped notifications from flowing toward the old service so we could mark it for deprecation.

One thing to note is that at each migration stage, key components were feature flagged so that we could make the switch back at any time to the old stable architecture without requiring a code push or service deployment. We also had a replay mechanism in place that allowed us to replay traffic across a time window (on either system) in the event that either service went down during the migration process.

The five-stage process and the above factors all combined to help us smoothly migrate this highly utilized service without any downtime!

The Benefits

“Mom, I made it!”

With the completion of this project and a smooth rollout, we instantly started reaping the benefits of our shiny new architecture:

  • Deployment time was cut by more than half as we now had a smaller build image, fewer tasks to deploy, and faster build times thanks to Golang.
  • Processing capacity increased by 100x compared to the previous architecture. This allowed us to remove the need for business logic in a previous step to filter requests and comfortably increase processing by 20x. Processing volume jumped from 200qps to 1000qps without breaking a sweat.
  • P99 of dwell time of a request in queue was reduced from minutes (and at times hours) to less than one minute.
  • Reduced load/latency on core system upstream. Message processing time was reduced by an average of 16%.
  • Costs stayed the same despite increased processing volume.
  • No more outages.

Lessons and Key Takeaways

“Learn from the mistakes of those who have walked before you.”

There are a ton of lessons that we learned from this self-initiated project that not just benefited the company, but us as engineers as well. Here are the key takeaways:

Understand your limitations.

  • Depending on your product requirements, optimize for velocity of delivery or scale accordingly. Additionally, know your limitations and understand when you might need a redesign. A better understanding of system limitations will help you decide between a refactor versus a rewrite.

Don’t be afraid to throw away code.

  • While designs often look good on paper, you can only know for sure what will work after implementing and testing out a PoC. Do not leave things to chance.

Know your infrastructure.

  • True costs come from badly managed infrastructure. Know your limits and your usages. Could the architecture be revamped to use resources more efficiently? What is the growth runway of your system, and will you be in dire need of a refactor soon?

We would once more like to thank Praveen Bathala for giving us advice and guidance for this project, without which this would not have been possible.

As a fast-growing company, we’ve got lots of interesting engineering challenges to solve—not just the quick way, but the right way. If these problems interest you, and you would like to further your growth as an engineer, we’re hiring! Learn more at our careers website.

Rearchitecting a System: Performing a Migration on a Production System With Zero Downtime

See Abnormal in Action

Get a Demo

Get the Latest Email Security Insights

Subscribe to our newsletter to receive updates on the latest attacks and new trends in the email threat landscape.

Get AI Protection for Your Human Interactions

Protect your organization from socially-engineered email attacks that target human behavior.
Request a Demo
Request a Demo

Related Posts

B AI Mbx Prompts
Discover how to unlock the full potential of the AI Security Mailbox with custom prompts designed to enhance your generative AI output.
Read More
B Protecting Microsoft Accounts Blog
Microsoft, with its vast user base, is a prime target for cybercriminals. Discover the top 5 attack strategies used to compromise its users and systems.
Read More
B Convergence S3 Announcement Blog
Join us for Season 3 of The Convergence of AI + Cybersecurity as we explore deepfakes, the evolving role of the SOC, and the intricacies of AI-native security.
Read More
B AISM Augmenting Customer Facing Product with AI Blog
Learn how Abnormal Security leverages large language models (LLMs) to enhance security awareness and automate SOC teams’ workflows with AI Security Mailbox.
Read More
B Education Targeted Attacks Blog
Cyberattacks on schools have surged, exposing 650K+ records in the last 60 days. As the school year begins, phishing is a key threat to students, teachers, and staff.
Read More
B Fed RAMP Announcement Blog
Abnormal is pursuing FedRAMP Moderate authorization, which enables us to empower federal agencies with AI-native email security against advanced cyber threats.
Read More