Rearchitecting a System: Performing a Migration on a Production System With Zero Downtime

We recently shared a look at how the Abnormal engineering team overhauled our Unwanted Mail service architecture to accommodate our rapid growth. Today, we’re diving into how the team migrated traffic to the new architecture—with zero downtime.

De Sheng Chuan

Yang Ang

September 12, 2022

“This will be easy, they said.”

Any engineer with experience in migrating from one system to another will tell you the process is anything but straightforward.

In a recent blog post, we shared how rapid growth in our customer base had led to an exponential increase in the email volume being processed by our Unwanted Mail service. We explained the challenges of scaling up to accommodate this growth as well as the shortcomings of the existing Python and Celery architecture for increasing our traffic capacity by a hundredfold.

We also provided a detailed analysis of the engineering problems we were facing, and how we explored and determined that a system built on Golang and Kafka would provide a much more reliable platform that will be more performant at scale.

In this second post, we’ll take a deep dive into how we performed the migration of traffic from the legacy system and ensured a smooth rollout process with no customer downtime. We will also take some time to quantify the improvements this migration had on our product.

The Game Plan

“Failing to plan is planning to fail.”

With a new architecture planned, we set out to isolate the portions that needed to be rewritten as well as plan a smooth rollout process to minimize (or eliminate) downtime. A proper plan was required because there were still ongoing feature requests and limited resources available. Given that the cons were mostly in development time, we decided to bite the bullet and dived into a rewrite.

Migrating From One Solution to the Next

No rewrite is perfect and expecting a perfect rewrite is unrealistic. We knew we needed a good migration process in place to ensure each key component was thoroughly tested and then rolled out with minimal downtime.

In the graphics below we share a high-level overview of our five-stage rollout process, which we hope will provide a reference point for eager learners.

Migration Stage 1: Mirroring Volume

Rearchitecting a System 1 Mirroring Volume

To ensure our Kafka setup was correct, we first mirrored the volume on existing architecture onto Kafka to ensure we provisioned infrastructure limits correctly. We then used this opportunity to address configuration concerns and data issues.

Migration Stage 2: Mirroring Requests

Rearchitecting a System 2 Mirroring Requests

Next, we mirrored requests onto a staging environment with our Golang workers implementation. This allowed us to ensure that the services could be built properly and also gave us a safe environment to test out bugs and use cases extensively.

Migration Stage 3: Comparing Metrics for Feature Parity

Rearchitecting a System 3 Comparing Metrics

Once the previous two stages were complete, we mirrored these requests onto a production instance using the new architecture, disabled actions that mutate data, and recorded metrics. By comparing these metrics with our current service we could make an assessment about feature parity.

Migration Stage 4: Gradual Customer Migration

Rearchitecting a System 4 Gradual Customer Migration

Once we had reasonable confidence that the new architecture was stable, we started gradually migrating customers over to the new architecture using feature flags. This process was spread out over a few days, with a small percentage being migrated each time and closely monitored.

Migration Stage 5: Complete Migration and Deprecate Old Flow

Rearchitecting a System 5 Complete Migration

Finally, using a global feature flag, we completed the migration process to the new architecture. We stopped notifications from flowing toward the old service so we could mark it for deprecation.

One thing to note is that at each migration stage, key components were feature flagged so that we could make the switch back at any time to the old stable architecture without requiring a code push or service deployment. We also had a replay mechanism in place that allowed us to replay traffic across a time window (on either system) in the event that either service went down during the migration process.

The five-stage process and the above factors all combined to help us smoothly migrate this highly utilized service without any downtime!

The Benefits

“Mom, I made it!”

With the completion of this project and a smooth rollout, we instantly started reaping the benefits of our shiny new architecture:

Deployment time was cut by more than half as we now had a smaller build image, fewer tasks to deploy, and faster build times thanks to Golang.
Processing capacity increased by 100x compared to the previous architecture. This allowed us to remove the need for business logic in a previous step to filter requests and comfortably increase processing by 20x. Processing volume jumped from 200qps to 1000qps without breaking a sweat.
P99 of dwell time of a request in queue was reduced from minutes (and at times hours) to less than one minute.
Reduced load/latency on core system upstream. Message processing time was reduced by an average of 16%.
Costs stayed the same despite increased processing volume.
No more outages.

Lessons and Key Takeaways

“Learn from the mistakes of those who have walked before you.”

There are a ton of lessons that we learned from this self-initiated project that not just benefited the company, but us as engineers as well. Here are the key takeaways:

Understand your limitations.

Depending on your product requirements, optimize for velocity of delivery or scale accordingly. Additionally, know your limitations and understand when you might need a redesign. A better understanding of system limitations will help you decide between a refactor versus a rewrite.

Don’t be afraid to throw away code.

While designs often look good on paper, you can only know for sure what will work after implementing and testing out a PoC. Do not leave things to chance.

Know your infrastructure.

True costs come from badly managed infrastructure. Know your limits and your usages. Could the architecture be revamped to use resources more efficiently? What is the growth runway of your system, and will you be in dire need of a refactor soon?

We would once more like to thank Praveen Bathala for giving us advice and guidance for this project, without which this would not have been possible.

As a fast-growing company, we’ve got lots of interesting engineering challenges to solve—not just the quick way, but the right way. If these problems interest you, and you would like to further your growth as an engineer, we’re hiring! Learn more at our careers website.

Get AI Protection for Your Human Interactions

Protect your organization from socially-engineered email attacks that target human behavior.

Request a Demo

Data & Trends

Mission Interrupted: Nonprofits Face a Rising Wave of Email Attacks

Advanced email attacks on nonprofits surged 35% year-over-year. Learn why cybercriminals are targeting the sector and how to stay protected.

B PDF Annotations Mask Malicious QR Codes Blog

Attack Stories

Hiding in Plain Sight: How Attackers Use PDF Annotations to Mask Malicious QR Codes

Attackers are exploiting PDF annotations to disguise phishing QR codes, bypassing security and deceiving users. Learn how this sophisticated threat works.

Credential Phishing

The Most Common Types of Phishing Attacks and Their Impact

Discover the most common types of phishing attacks and their impacts. Learn how cybercriminals exploit deception to compromise security and steal sensitive information.

Product

Fueling Stronger Security: How Abnormal Filled Gaps Left by Proofpoint for a Leading Fuel and Convenience Retailer

Learn how a trusted fuel and convenience retailer blocked 2,300+ attacks missed by Proofpoint and reclaimed 300+ employee hours per month by adding Abnormal.

Business Email Compromise

BEC in the Age of AI: The Growing Threat

Business email compromise (BEC) has seen growth due to criminals adopting AI tools. See the trends and discover how to protect your business from cybercriminals.

Account Takeover

Account Compromise Arms Race: How Threat Actors Evade Phish-Resistant Security Tools

Discover how cybercriminals are adapting to phish-resistant authentication, using session hijacking, info-stealer malware, and consent phishing to bypass security controls.

Rearchitecting a System: Performing a Migration on a Production System With Zero Downtime

The Game Plan

Migrating From One Solution to the Next

The Benefits

Lessons and Key Takeaways

See Abnormal in Action

Get the Latest Email Security Insights

Get AI Protection for Your Human Interactions

Related Posts