It wasn't DNS after all


The DNS failure was the first symptom, not the root cause of the recent AWS outage.

The root cause was a race condition in an internal DynamoDB microservice that automates DNS record management for the regional cells of DynamoDB.

Like many AWS services, DynamoDB has a cell-based architecture.

video preview

(see my conversation with Khawaja Shams, who used to lead the DynamoDB team, on this topic)

Every cell has an automated system that keeps its DNS entries in sync.

That automation system has two main components:

  • a DNS Planner generates a plan for how DNS records should look.
  • DNS Enactors to apply those plans in Route 53.

The race condition happened between the DNS enactors and ultimately applied an empty DNS record and rendered the DynamoDB service inaccessible.

Because EC2's control plane uses DynamoDB to manage distributed locks and leases (to avoid race conditions), the outage to DynamoDB meant EC2 couldn't launch new instances.

New instances were created at the hypervisor level, but their networking configuration never completed.

Then NLB marked newly launched instances as unhealthy, because their networking state wasn’t completed. This triggered large-scale health check failures, removing valid back-ends from load balancers.

So yeah, DNS was the first symptom of the problem, but it wasn't the root cause of the outage. That honour belongs to the race condition in the DNS management system inside DynamoDB!

(or, you can also go a "why?" further and attribute the root cause to whatever caused the unusual high delay in the enactors, but it wasn't explained in the post mortem)

You can read the full post mortem here, it's quite long but worth a read.

Master Serverless

Join 17K readers and level up you AWS game with just 5 mins a week.

Read more from Master Serverless

If you use Claude Code a lot, you’ve probably run into usage limits, sometimes even in short coding sessions. But cost isn’t the only problem. In long-running sessions, the context window eventually fills up, and that can cause the agent to forget earlier decisions, lose important details, or come back from compaction with gaps in its working memory. Here are three tools worth checking out if you want to reduce token usage and make longer coding sessions possible. 1. CavemanThis is a Claude...

AI agents can now scan an entire open-source codebase for exploitable vulnerabilities in hours. Frontier models carry the complete library of known bug classes in their weights. So you can simply point an AI agent at a codebase and tell it to find zero-days. This isn't theoretical. Willy Tarreau, the HAProxy lead developer, reports that security bug reports have jumped from 2–3 per week to 5–10 per day. Greg Kroah-Hartman, the Linux kernel maintainer, described what happened: "Months ago, we...

Lambda Durable Functions makes it easy to implement business workflows using plain Lambda functions. Besides the intended use cases, they also let us implement ETL jobs without needing recursions or Step Functions. Many long-running ETL jobs have a time-consuming, sequential steps that cannot be easily parallelised. For example: Fetching data from shared databases/APIs with throughput limits. When data needs to be processed sequentially. Historically, Lambda was not a good fit for these...