Abstract Violet Grid

Rearchitecting a System: Performing a Migration on a Production System With Zero Downtime

We recently shared a look at how the Abnormal engineering team overhauled our Unwanted Mail service architecture to accommodate our rapid growth. Today, we’re diving into how the team migrated traffic to the new architecture—with zero downtime.

September 12, 2022
“This will be easy, they said.”

Any engineer with experience in migrating from one system to another will tell you the process is anything but straightforward.

In a recent blog post, we shared how rapid growth in our customer base had led to an exponential increase in the email volume being processed by our Unwanted Mail service. We explained the challenges of scaling up to accommodate this growth as well as the shortcomings of the existing Python and Celery architecture for increasing our traffic capacity by a hundredfold.

We also provided a detailed analysis of the engineering problems we were facing, and how we explored and determined that a system built on Golang and Kafka would provide a much more reliable platform that will be more performant at scale.

In this second post, we’ll take a deep dive into how we performed the migration of traffic from the legacy system and ensured a smooth rollout process with no customer downtime. We will also take some time to quantify the improvements this migration had on our product.

The Game Plan

“Failing to plan is planning to fail.”

With a new architecture planned, we set out to isolate the portions that needed to be rewritten as well as plan a smooth rollout process to minimize (or eliminate) downtime. A proper plan was required because there were still ongoing feature requests and limited resources available. Given that the cons were mostly in development time, we decided to bite the bullet and dived into a rewrite.

Migrating From One Solution to the Next

No rewrite is perfect and expecting a perfect rewrite is unrealistic. We knew we needed a good migration process in place to ensure each key component was thoroughly tested and then rolled out with minimal downtime.

In the graphics below we share a high-level overview of our five-stage rollout process, which we hope will provide a reference point for eager learners.

Migration Stage 1: Mirroring Volume

Rearchitecting a System 1 Mirroring Volume

To ensure our Kafka setup was correct, we first mirrored the volume on existing architecture onto Kafka to ensure we provisioned infrastructure limits correctly. We then used this opportunity to address configuration concerns and data issues.

Migration Stage 2: Mirroring Requests

Rearchitecting a System 2 Mirroring Requests

Next, we mirrored requests onto a staging environment with our Golang workers implementation. This allowed us to ensure that the services could be built properly and also gave us a safe environment to test out bugs and use cases extensively.

Migration Stage 3: Comparing Metrics for Feature Parity

Rearchitecting a System 3 Comparing Metrics

Once the previous two stages were complete, we mirrored these requests onto a production instance using the new architecture, disabled actions that mutate data, and recorded metrics. By comparing these metrics with our current service we could make an assessment about feature parity.

Migration Stage 4: Gradual Customer Migration

Rearchitecting a System 4 Gradual Customer Migration

Once we had reasonable confidence that the new architecture was stable, we started gradually migrating customers over to the new architecture using feature flags. This process was spread out over a few days, with a small percentage being migrated each time and closely monitored.

Migration Stage 5: Complete Migration and Deprecate Old Flow

Rearchitecting a System 5 Complete Migration

Finally, using a global feature flag, we completed the migration process to the new architecture. We stopped notifications from flowing toward the old service so we could mark it for deprecation.

One thing to note is that at each migration stage, key components were feature flagged so that we could make the switch back at any time to the old stable architecture without requiring a code push or service deployment. We also had a replay mechanism in place that allowed us to replay traffic across a time window (on either system) in the event that either service went down during the migration process.

The five-stage process and the above factors all combined to help us smoothly migrate this highly utilized service without any downtime!

The Benefits

“Mom, I made it!”

With the completion of this project and a smooth rollout, we instantly started reaping the benefits of our shiny new architecture:

  • Deployment time was cut by more than half as we now had a smaller build image, fewer tasks to deploy, and faster build times thanks to Golang.
  • Processing capacity increased by 100x compared to the previous architecture. This allowed us to remove the need for business logic in a previous step to filter requests and comfortably increase processing by 20x. Processing volume jumped from 200qps to 1000qps without breaking a sweat.
  • P99 of dwell time of a request in queue was reduced from minutes (and at times hours) to less than one minute.
  • Reduced load/latency on core system upstream. Message processing time was reduced by an average of 16%.
  • Costs stayed the same despite increased processing volume.
  • No more outages.

Lessons and Key Takeaways

“Learn from the mistakes of those who have walked before you.”

There are a ton of lessons that we learned from this self-initiated project that not just benefited the company, but us as engineers as well. Here are the key takeaways:

Understand your limitations.

  • Depending on your product requirements, optimize for velocity of delivery or scale accordingly. Additionally, know your limitations and understand when you might need a redesign. A better understanding of system limitations will help you decide between a refactor versus a rewrite.

Don’t be afraid to throw away code.

  • While designs often look good on paper, you can only know for sure what will work after implementing and testing out a PoC. Do not leave things to chance.

Know your infrastructure.

  • True costs come from badly managed infrastructure. Know your limits and your usages. Could the architecture be revamped to use resources more efficiently? What is the growth runway of your system, and will you be in dire need of a refactor soon?

We would once more like to thank Praveen Bathala for giving us advice and guidance for this project, without which this would not have been possible.

As a fast-growing company, we’ve got lots of interesting engineering challenges to solve—not just the quick way, but the right way. If these problems interest you, and you would like to further your growth as an engineer, we’re hiring! Learn more at our careers website.

Image

Prevent the Attacks That Matter Most

Get the Latest Email Security Insights

Subscribe to our newsletter to receive updates on the latest attacks and new trends in the email threat landscape.

0
Demo 2x 1

See the Abnormal Solution to the Email Security Problem

Protect your organization from the attacks that matter most with Abnormal Integrated Cloud Email Security.

Related Posts

B 09 29 22 CISO Cybersecurity Awareness Month
October is here, which means Cybersecurity Awareness Month is officially in full swing! These five tips can help security leaders take full advantage of the month.
Read More
B Email Security Challenges Blog 09 26 22
Understanding common email security challenges caused by your legacy technology will help you determine the best solution to improve your security posture.
Read More
B 5 Crucial Tips
Retailers are a popular target for threat actors due to their wealth of customer data and availability of funds. Here are 5 cybersecurity tips to help retailers reduce their risk of attack.
Read More
B 3 Essential Elements
Legacy approaches to managing unwanted mail are neither practical nor scalable. Learn the 3 essential elements of modern, effective graymail management.
Read More
B Back to School
Discover how threat group Chiffon Herring leverages impersonation and spoofed email addresses to divert paychecks to mule accounts.
Read More
B 09 06 22 Rearchitecting a System Blog
We recently shared a look at how the Abnormal engineering team overhauled our Unwanted Mail service architecture to accommodate our rapid growth. Today, we’re diving into how the team migrated traffic to the new architecture—with zero downtime.
Read More
B Industry Leading CIS Os
Stay up to date on the latest cybersecurity trends, industry news, and best practices by following these 12 innovative and influential thought leaders on social media.
Read More
B Podcast Engineering 11 08 24 22
In episode 11 of Abnormal Engineering Stories, David Hagar, Director of Engineering and Abnormal Head of UK Engineering, continues his conversation with Zehan Wang, co-founder of Magic Pony.
Read More
B Overhauled Architecture Blog 08 29 22
As our customer base has expanded, so has the volume of emails our system processes. Here’s how we overcame scaling challenges with one service in particular.
Read More