It wasn't DNS after all


The DNS failure was the first symptom, not the root cause of the recent AWS outage.

The root cause was a race condition in an internal DynamoDB microservice that automates DNS record management for the regional cells of DynamoDB.

Like many AWS services, DynamoDB has a cell-based architecture.

video preview

(see my conversation with Khawaja Shams, who used to lead the DynamoDB team, on this topic)

Every cell has an automated system that keeps its DNS entries in sync.

That automation system has two main components:

  • a DNS Planner generates a plan for how DNS records should look.
  • DNS Enactors to apply those plans in Route 53.

The race condition happened between the DNS enactors and ultimately applied an empty DNS record and rendered the DynamoDB service inaccessible.

Because EC2's control plane uses DynamoDB to manage distributed locks and leases (to avoid race conditions), the outage to DynamoDB meant EC2 couldn't launch new instances.

New instances were created at the hypervisor level, but their networking configuration never completed.

Then NLB marked newly launched instances as unhealthy, because their networking state wasn’t completed. This triggered large-scale health check failures, removing valid back-ends from load balancers.

So yeah, DNS was the first symptom of the problem, but it wasn't the root cause of the outage. That honour belongs to the race condition in the DNS management system inside DynamoDB!

(or, you can also go a "why?" further and attribute the root cause to whatever caused the unusual high delay in the enactors, but it wasn't explained in the post mortem)

You can read the full post mortem here, it's quite long but worth a read.

Master Serverless

Join 17K readers and level up you AWS game with just 5 mins a week.

Read more from Master Serverless

Step Functions is often used to poll long-running processes, e.g. when starting a new data migration task with Amazon Database Migration. There's usually a Wait -> Poll -> Choice loop that runs until the task is complete (or failed), like the one below. Polling is inefficient and can add unnecessary cost as standard workflows are charged based on the number of state transitions. There is an event-driven alternative to this approach. Here's the high level approach: To start the data migration,...

Lambda Durable Functions comes with a handy testing SDK. It makes it easy to test durable executions both locally as well as remotely in the cloud. I find the local test runner particular useful for dealing with wait states because I can simply configure the runner to skip time! However, this does not work for callback operations such as waitForCallback. Unfortunately, the official docs didn't include any examples on how to handle this. So here's my workaround. The handler code Imagine you're...

Lambda Durable Functions is a powerful new feature, but its checkpoint + replay model has a few gotchas. Here are five to watch out for. Non-deterministic code The biggest gotcha is when the code is not deterministic. That is, it might do something different during replay. Remember, when a durable execution is replayed, the handler code is executed from the start. So the code must behave exactly the same given the same input. If you use random numbers, or timestamps to make branching...