How to perform database migration for a live service with no downtime


Read on my blog

Read time: 3 minutes

This issue is brought to you by Serverless Guru, your guide to cloud excellence, helping you every step of your serverless journey, including team training, pattern development, mass service migrations, architecting, and developing new solutions. Speak to a Guru today.

Migrating the database while continuing to serve user requests can be challenging. It’s a question that many students have asked during the Production-Ready Serverless workshop.

So here’s my tried-and-tested approach to migrating a live service to a new database without downtime. I’m going to use DynamoDB as an example but it should work with most other databases.

Can you keep it simple?

Before we dive into it, I want to remind you to keep things simple whenever you can. If the database migration can be completed within a reasonable timeframe, then consider doing it over a small maintenance window.

This is often not possible for large applications with a global user base. Or maybe you’re working in a microservices environment where downtime for a single service can impact many others.

However, it might be a good option for smaller applications or applications with a regional user base.

Ok, with that said, let’s go.

Step 1 — redirect writes to the new database

First, make sure all inserts and updates go to the new database.

Step 2 — use the old database as fallback

Use the old database as a fallback for read operations. If the intended data is not available in the new database then fetch it from the old database and save it into the new database.

This is similar to a read-through cache.

Implementing these two steps will deal with the active data that users are interacting with.

Step 3 — run a script to migrate inactive data

Run a background script to migrate all data to the new database.

You should start the background script AFTER the application has been updated to perform Steps 1 & 2 above. Once the application has been updated, it will write the active data into the new database.

We need to make sure the script doesn’t override newer versions of the data we’re migrating.

Assuming the new database is a DynamoDB table, we need to use conditional puts. Use the attribute_not_exists conditional function to ensure the item doesn’t exist in the DynamoDB table already.

Dealing with deletes

But what about deletes?

This sequence of events will be problematic:

  1. The background script reads data from the old database.
  2. The application receives a request to delete the data. The data doesn’t exist in the new database.
  3. The application deletes the data from the old database.
  4. The background script writes the data into the new database.

Oops, we just added a piece of deleted data back into the system!

Thank you, race condition…

To handle this scenario, we can write a tombstone record in the new database. This stops the background script from writing the deleted data back into the system.

However, it might require behaviour change in the application to handle these tombstone records in read operations. Luckily, it doesn’t have to be forever.

Tombstones are necessary during the migration process. But once the background script has finished you can clean things up by:

  1. Run another script against the new database to delete all tombstones.
  2. Update the application to remove the code that handles tombstones (in read operations).

Wrap up

This is my simple, 3-step process to migrate a live service to a new database. As mentioned at the start of this post, it should apply to most database systems. For this process to work, your new database needs to support some form of conditional write operation.

If you want to learn more about building production-ready serverless applications, then why not check out my next workshop?

The next cohort starts on January 8th, so there is still time to sign up and level up your serverless game in 2024!


Whenever you're ready, here are 3 ways I can help you:

  1. Production-Ready Serverless: Join 20+ AWS Heroes & Community Builders and 1000+ other students in levelling up your serverless game and becoming the serverless expert in your company.
  2. Consulting: If you’re looking to improve feature velocity, reduce costs, and make your systems more scalable, secure, and resilient, then allow me to help. I provide a full range of consulting services, from advisory calls and architecture reviews, all the way to building your entire application for you.
  3. Promote your service to tens of thousands of developers by taking out a media sponsorship with me.

Master Serverless

Join 17K readers and level up you AWS game with just 5 mins a week.

Read more from Master Serverless

Lambda Durable Functions makes it easy to implement business workflows using plain Lambda functions. Besides the intended use cases, they also let us implement ETL jobs without needing recursions or Step Functions. Many long-running ETL jobs have a time-consuming, sequential steps that cannot be easily parallelised. For example: Fetching data from shared databases/APIs with throughput limits. When data needs to be processed sequentially. Historically, Lambda was not a good fit for these...

Step Functions is often used to poll long-running processes, e.g. when starting a new data migration task with Amazon Database Migration. There's usually a Wait -> Poll -> Choice loop that runs until the task is complete (or failed), like the one below. Polling is inefficient and can add unnecessary cost as standard workflows are charged based on the number of state transitions. There is an event-driven alternative to this approach. Here's the high level approach: To start the data migration,...

Lambda Durable Functions comes with a handy testing SDK. It makes it easy to test durable executions both locally as well as remotely in the cloud. I find the local test runner particular useful for dealing with wait states because I can simply configure the runner to skip time! However, this does not work for callback operations such as waitForCallback. Unfortunately, the official docs didn't include any examples on how to handle this. So here's my workaround. The handler code Imagine you're...