How to perform database migration for a live service with no downtime


Read on my blog

Read time: 3 minutes

This issue is brought to you by Serverless Guru, your guide to cloud excellence, helping you every step of your serverless journey, including team training, pattern development, mass service migrations, architecting, and developing new solutions. Speak to a Guru today.

Migrating the database while continuing to serve user requests can be challenging. It’s a question that many students have asked during the Production-Ready Serverless workshop.

So here’s my tried-and-tested approach to migrating a live service to a new database without downtime. I’m going to use DynamoDB as an example but it should work with most other databases.

Can you keep it simple?

Before we dive into it, I want to remind you to keep things simple whenever you can. If the database migration can be completed within a reasonable timeframe, then consider doing it over a small maintenance window.

This is often not possible for large applications with a global user base. Or maybe you’re working in a microservices environment where downtime for a single service can impact many others.

However, it might be a good option for smaller applications or applications with a regional user base.

Ok, with that said, let’s go.

Step 1 — redirect writes to the new database

First, make sure all inserts and updates go to the new database.

Step 2 — use the old database as fallback

Use the old database as a fallback for read operations. If the intended data is not available in the new database then fetch it from the old database and save it into the new database.

This is similar to a read-through cache.

Implementing these two steps will deal with the active data that users are interacting with.

Step 3 — run a script to migrate inactive data

Run a background script to migrate all data to the new database.

You should start the background script AFTER the application has been updated to perform Steps 1 & 2 above. Once the application has been updated, it will write the active data into the new database.

We need to make sure the script doesn’t override newer versions of the data we’re migrating.

Assuming the new database is a DynamoDB table, we need to use conditional puts. Use the attribute_not_exists conditional function to ensure the item doesn’t exist in the DynamoDB table already.

Dealing with deletes

But what about deletes?

This sequence of events will be problematic:

  1. The background script reads data from the old database.
  2. The application receives a request to delete the data. The data doesn’t exist in the new database.
  3. The application deletes the data from the old database.
  4. The background script writes the data into the new database.

Oops, we just added a piece of deleted data back into the system!

Thank you, race condition…

To handle this scenario, we can write a tombstone record in the new database. This stops the background script from writing the deleted data back into the system.

However, it might require behaviour change in the application to handle these tombstone records in read operations. Luckily, it doesn’t have to be forever.

Tombstones are necessary during the migration process. But once the background script has finished you can clean things up by:

  1. Run another script against the new database to delete all tombstones.
  2. Update the application to remove the code that handles tombstones (in read operations).

Wrap up

This is my simple, 3-step process to migrate a live service to a new database. As mentioned at the start of this post, it should apply to most database systems. For this process to work, your new database needs to support some form of conditional write operation.

If you want to learn more about building production-ready serverless applications, then why not check out my next workshop?

The next cohort starts on January 8th, so there is still time to sign up and level up your serverless game in 2024!


Whenever you're ready, here are 3 ways I can help you:

  1. Production-Ready Serverless: Join 20+ AWS Heroes & Community Builders and 1000+ other students in levelling up your serverless game and becoming the serverless expert in your company.
  2. Consulting: If you’re looking to improve feature velocity, reduce costs, and make your systems more scalable, secure, and resilient, then allow me to help. I provide a full range of consulting services, from advisory calls and architecture reviews, all the way to building your entire application for you.
  3. Promote your service to tens of thousands of developers by taking out a media sponsorship with me.

Master Serverless

Join 17K readers and level up you AWS game with just 5 mins a week.

Read more from Master Serverless

Modern applications rarely do just one thing at a time. An API request creates an order, and then another service needs to reserve stock, another to charge the customer, another to send an email, and so on. In a serverless or event-driven architecture, follow-up actions are usually triggered by messages (either events or commands). That gives us loose coupling, better scalability, and independent services. But it also introduces a reliability problem. “What happens when the database update...

If you use Claude Code a lot, you’ve probably run into usage limits, sometimes even in short coding sessions. But cost isn’t the only problem. In long-running sessions, the context window eventually fills up, and that can cause the agent to forget earlier decisions, lose important details, or come back from compaction with gaps in its working memory. Here are three tools worth checking out if you want to reduce token usage and make longer coding sessions possible. 1. CavemanThis is a Claude...

AI agents can now scan an entire open-source codebase for exploitable vulnerabilities in hours. Frontier models carry the complete library of known bug classes in their weights. So you can simply point an AI agent at a codebase and tell it to find zero-days. This isn't theoretical. Willy Tarreau, the HAProxy lead developer, reports that security bug reports have jumped from 2–3 per week to 5–10 per day. Greg Kroah-Hartman, the Linux kernel maintainer, described what happened: "Months ago, we...