It wasn't DNS after all

Published 2 months ago • 1 min read

The DNS failure was the first symptom, not the root cause of the recent AWS outage.

The root cause was a race condition in an internal DynamoDB microservice that automates DNS record management for the regional cells of DynamoDB.

Like many AWS services, DynamoDB has a cell-based architecture.

(see my conversation with Khawaja Shams, who used to lead the DynamoDB team, on this topic)

Every cell has an automated system that keeps its DNS entries in sync.

That automation system has two main components:

a DNS Planner generates a plan for how DNS records should look.
DNS Enactors to apply those plans in Route 53.

The race condition happened between the DNS enactors and ultimately applied an empty DNS record and rendered the DynamoDB service inaccessible.

Because EC2's control plane uses DynamoDB to manage distributed locks and leases (to avoid race conditions), the outage to DynamoDB meant EC2 couldn't launch new instances.

New instances were created at the hypervisor level, but their networking configuration never completed.

Then NLB marked newly launched instances as unhealthy, because their networking state wasn’t completed. This triggered large-scale health check failures, removing valid back-ends from load balancers.

So yeah, DNS was the first symptom of the problem, but it wasn't the root cause of the outage. That honour belongs to the race condition in the DNS management system inside DynamoDB!

(or, you can also go a "why?" further and attribute the root cause to whatever caused the unusual high delay in the enactors, but it wasn't explained in the post mortem)

You can read the full post mortem here, it's quite long but worth a read.

AppSync: how to implement unauthenticated operations

AppSync doesn’t allow unauthenticated API calls. To allow users to call your API without first authenticating themselves, you must mimic the behaviour using one of the available authorization methods [1]. In this post, let’s look at three ways to implement unauthenticated GraphQL operations with AppSync and their pros & cons. API Keys To use API keys, you need to: Add an API Key in AppSync. Pass the API Key in the x-api-key header. That’s it! It’s the easiest and most common way to implement...

6 months ago • 2 min read

How to version APIs with API Gateway and Lambda

A common challenge when building APIs is supporting multiple live versions without letting the system turn into an unmaintainable mess. You need to keep older versions running for existing clients, roll out new versions safely, and avoid breaking changes that might take down production. And you need to do all that without duplicating too much infrastructure or introducing spaghetti logic inside your code. There’s no official pattern for versioning APIs in API Gateway + Lambda. API Gateway...

7 months ago • 3 min read

Bye bye schema coupling, hello semantic coupling

I recently shared six event versioning strategies for event-driven architectures [1]. In response to this, Marty Pitt reached out and showed me how Orbital [2] and Taxi [3] use semantic tags to eliminate schema coupling in event-driven architectures and simplify the schema management. It's a novel way to manage schema evolution, and I want to share what I learnt with you. Problems with Schema Coupling In an event-driven architecture, event consumers are typically coupled to the schema of the...

8 months ago • 2 min read

Master Serverless

It wasn't DNS after all

Master Serverless

AppSync: how to implement unauthenticated operations

How to version APIs with API Gateway and Lambda

Bye bye schema coupling, hello semantic coupling