Read on my blog
Read time: 3 minutes
Migrating the database while continuing to serve user requests can be challenging. It’s a question that many students have asked during the Production-Ready Serverless workshop.
So here’s my tried-and-tested approach to migrating a live service to a new database without downtime. I’m going to use DynamoDB as an example but it should work with most other databases.
Before we dive into it, I want to remind you to keep things simple whenever you can. If the database migration can be completed within a reasonable timeframe, then consider doing it over a small maintenance window.
This is often not possible for large applications with a global user base. Or maybe you’re working in a microservices environment where downtime for a single service can impact many others.
However, it might be a good option for smaller applications or applications with a regional user base.
Ok, with that said, let’s go.
First, make sure all inserts and updates go to the new database.
Use the old database as a fallback for read operations. If the intended data is not available in the new database then fetch it from the old database and save it into the new database.
This is similar to a read-through cache.
Implementing these two steps will deal with the active data that users are interacting with.
Run a background script to migrate all data to the new database.
You should start the background script AFTER the application has been updated to perform Steps 1 & 2 above. Once the application has been updated, it will write the active data into the new database.
We need to make sure the script doesn’t override newer versions of the data we’re migrating.
Assuming the new database is a DynamoDB table, we need to use conditional puts. Use the attribute_not_exists conditional function to ensure the item doesn’t exist in the DynamoDB table already.
But what about deletes?
This sequence of events will be problematic:
Oops, we just added a piece of deleted data back into the system!
Thank you, race condition…
To handle this scenario, we can write a tombstone record in the new database. This stops the background script from writing the deleted data back into the system.
However, it might require behaviour change in the application to handle these tombstone records in read operations. Luckily, it doesn’t have to be forever.
Tombstones are necessary during the migration process. But once the background script has finished you can clean things up by:
This is my simple, 3-step process to migrate a live service to a new database. As mentioned at the start of this post, it should apply to most database systems. For this process to work, your new database needs to support some form of conditional write operation.
If you want to learn more about building production-ready serverless applications, then why not check out my next workshop?
The next cohort starts on January 8th, so there is still time to sign up and level up your serverless game in 2024!
Whenever you're ready, here are 3 ways I can help you:
Join 17K readers and level up you AWS game with just 5 mins a week.
Lambda Durable Functions makes it easy to implement business workflows using plain Lambda functions. Besides the intended use cases, they also let us implement ETL jobs without needing recursions or Step Functions. Many long-running ETL jobs have a time-consuming, sequential steps that cannot be easily parallelised. For example: Fetching data from shared databases/APIs with throughput limits. When data needs to be processed sequentially. Historically, Lambda was not a good fit for these...
Step Functions is often used to poll long-running processes, e.g. when starting a new data migration task with Amazon Database Migration. There's usually a Wait -> Poll -> Choice loop that runs until the task is complete (or failed), like the one below. Polling is inefficient and can add unnecessary cost as standard workflows are charged based on the number of state transitions. There is an event-driven alternative to this approach. Here's the high level approach: To start the data migration,...
Lambda Durable Functions comes with a handy testing SDK. It makes it easy to test durable executions both locally as well as remotely in the cloud. I find the local test runner particular useful for dealing with wait states because I can simply configure the runner to skip time! However, this does not work for callback operations such as waitForCallback. Unfortunately, the official docs didn't include any examples on how to handle this. So here's my workaround. The handler code Imagine you're...