The upside & downside of DOS'ng yourself in production


Years ago, I worked at a large e-commerce company that was one of the biggest food delivery services in the UK.

They did something very interesting - they regularly ran load tests against production using fake orders.

As a partial observer, here's what I think we can learn from this practice and how it partially caused the biggest outages they ever experienced (but not from the load test itself!).

Load Testing in production

As a food delivery service, they experienced large traffic spikes during lunch and dinner hours. The spike is especially pronounced at dinner times.

As the business continued to grow, they wanted confidence that their infrastructure could keep up with traffic demands.

Chaos engineering was the talk of the town, and the idea of "testing in production" was a particular fascination to the team.

What better way to know that you can ACTUALLY handle a sudden increase in orders than simulating it in production?

As such, they regularly run load tests in production and push the peak load to ~150% of the actual peak production load.

These load tests created fake orders that traverse all the way through the system and touched every part of the system. These fake orders were flagged in the database and excluded from business analytics and did not affect KPI results.

The good

These tests told us that the system was able to handle at least 50% more peak load. If any scalability issues were identified, it gave the team time to react.

The tests could be quickly shut down at the first sign of trouble.

Everything was running on EC2, the load tests also ensured the system was sufficiently scaled to handle much more load than anticipated. If there was an unexpected increase in real traffic, then the load tests can be shut down to make room for the real traffic.

The team put a lot of care and thought into the execution of the load tests, ensuring they are carried out safely and can be stopped in a moment's notice.

Overall, I thought the load tests were carried out methodically and it was a refreshing approach.

The bad

However, nothing was done to clean up the fake orders from the system. I flagged this as a potential problem because the database had an enormous amount of fake orders.

And it wasn't just the orders.

As the fake orders traversed through the system, they left behind many other data trails in various database tables.

Another important factor to consider is that, this was kinda a distributed monolith. They had started the path towards microservices a few years earlier, but there was still a shared, self-hosted SQLServer database running on EC2.

Many of the microservices have their own databases, but the data changes were synch'd to this shared database as it was still used by other services.

What I didn't know at the time was that they had maxed out on the number of EBS volumes they can attach. They had the biggest, fastest EBS volumes money could buy and they were completely maxed out.

The database was so big they literally reached the vertical scaling limit.

I later learnt that, something like 40% of the data volume were attributed to fake orders.

The ugly

The amount of fake orders in the system added stress to the shared SQLServer database. But at least it wasn't the source of truth for many of their critical services. It was a single point of failure.

Or so I thought.

Two years after I left the company, they had their biggest outage ever and it lasted over several days.

The SQLServer database blew up (figuratively).

I later learnt that the database was used by an internal tool used by customer support. Someone ran a query so expensive it grinded the database to a halt.

Remember, this was a gigantic database. The size of the database was a big contributing factor.

And it turns out many of the microservices that had their own database still needed data from the SQLServer database to operate. So when the database was down, everything failed.

The aftermath

A lot of changes were insigated after the incident. There were significant changes to the technical leadership. They finally addressed many of their long-standing technical debts.

The fake orders were removed and they stopped the practice of running load tests in production (AFAIK).

They finally implemented caching in some parts of the systems - such as the menu service. Amazingly, they never cached menu data because the service team was convinced it was more performant and scalable to load the data from the database...

All and all, they had an amazing turnaround in the engineering department.

The funny thing was, during their outage, all their customers went to their closest competitor and took them offline... so in the end, they didn't suffer a huge business loss.

What can we learn from this?

System isolation matters

In a distributed architecture, every service should have its own database. Sharing databases creates a single point-of-failure. It creates multiple forms of coupling [1] between systems - temporal, topology and format.

Uptime cost has a sunk cost

When it comes to relational databases such as RDS or self-hosted SQLServer, there's a sunk cost fallacy.

Because you're already paying for uptime for a beefy database server already, it's more cost-efficient to reuse it. This economic force pushes towards shared databases.

And since you're sharing a database already, it's just easier to read another service's data directly instead of going through its API.

Before you know it, you have a distributed monolith with so much implicit coupling between services. A single schema change can break multiple, unrelated services. And one expensive query can kill the whole system.

Don't overlook internal tools

So many engineering teams have a blind spot when it comes to their internal tools. There has been many examples of internal tools causing global outages (e.g. by pushing out bad server configs) or security breaches.

When you assess the performance and scalability of your system, pay close to attention to these internal tools.

Never let a good crisis go to waste

As bad as the outage was, they were able to use it as a catalyst to clean up their act. They cleaned up years of technical debt and rethought key architectural decisions.

Something similar also happened at DAZN, where it took a good outage to muster the commitment and drive for a widespread adoption of consumer-driven contract testing.

Nobody likes an outage. But they can be a useful catalyst for good things to follow.

Build for success

Making it in a competitive market is tough. Sometimes your only chance for success is if your competitor slips up.

In this case, my former employer's competitor missed their shot and were later acquired (by my former employer). There's no room for a distant second in the food delivery business.

This reminds me of a hard lesson that I had to learn earlier in my career. That you have to build for success.

Your system needs to be both cost-efficient at low scale and has the ability to quickly scale up when it needs to. Because success can come from a single tweet by a celebrity endorser, or an article, or a competitor having a bad day.

You need to be ready for it!

In many ways, this is exactly why I'm so enamoured with serverless technologies like Lambda.

You can build systems that cost pennies to run [2] when you are starting out. But when success comes, the system can instantly scale to thousands of requests per second [3] without breaking a sweat!

Yes, the cost will go up in the short term, but so would your revenue. At least now you have a chance at sustaining that initial taste of success and hopefully see years of hard work pay off.

Links

[1] Many facets of coupling

[2] He built a hotel booking system that costs $0.82/month to run

[3] Lambda is more scalable than you think

Master Serverless

Join 17K readers and level up you AWS game with just 5 mins a week.

Read more from Master Serverless

Lambda Durable Functions makes it easy to implement business workflows using plain Lambda functions. Besides the intended use cases, they also let us implement ETL jobs without needing recursions or Step Functions. Many long-running ETL jobs have a time-consuming, sequential steps that cannot be easily parallelised. For example: Fetching data from shared databases/APIs with throughput limits. When data needs to be processed sequentially. Historically, Lambda was not a good fit for these...

Step Functions is often used to poll long-running processes, e.g. when starting a new data migration task with Amazon Database Migration. There's usually a Wait -> Poll -> Choice loop that runs until the task is complete (or failed), like the one below. Polling is inefficient and can add unnecessary cost as standard workflows are charged based on the number of state transitions. There is an event-driven alternative to this approach. Here's the high level approach: To start the data migration,...

Lambda Durable Functions comes with a handy testing SDK. It makes it easy to test durable executions both locally as well as remotely in the cloud. I find the local test runner particular useful for dealing with wait states because I can simply configure the runner to skip time! However, this does not work for callback operations such as waitForCallback. Unfortunately, the official docs didn't include any examples on how to handle this. So here's my workaround. The handler code Imagine you're...