The upside & downside of DOS'ng yourself in production


Years ago, I worked at a large e-commerce company that was one of the biggest food delivery services in the UK.

They did something very interesting - they regularly ran load tests against production using fake orders.

As a partial observer, here's what I think we can learn from this practice and how it partially caused the biggest outages they ever experienced (but not from the load test itself!).

Load Testing in production

As a food delivery service, they experienced large traffic spikes during lunch and dinner hours. The spike is especially pronounced at dinner times.

As the business continued to grow, they wanted confidence that their infrastructure could keep up with traffic demands.

Chaos engineering was the talk of the town, and the idea of "testing in production" was a particular fascination to the team.

What better way to know that you can ACTUALLY handle a sudden increase in orders than simulating it in production?

As such, they regularly run load tests in production and push the peak load to ~150% of the actual peak production load.

These load tests created fake orders that traverse all the way through the system and touched every part of the system. These fake orders were flagged in the database and excluded from business analytics and did not affect KPI results.

The good

These tests told us that the system was able to handle at least 50% more peak load. If any scalability issues were identified, it gave the team time to react.

The tests could be quickly shut down at the first sign of trouble.

Everything was running on EC2, the load tests also ensured the system was sufficiently scaled to handle much more load than anticipated. If there was an unexpected increase in real traffic, then the load tests can be shut down to make room for the real traffic.

The team put a lot of care and thought into the execution of the load tests, ensuring they are carried out safely and can be stopped in a moment's notice.

Overall, I thought the load tests were carried out methodically and it was a refreshing approach.

The bad

However, nothing was done to clean up the fake orders from the system. I flagged this as a potential problem because the database had an enormous amount of fake orders.

And it wasn't just the orders.

As the fake orders traversed through the system, they left behind many other data trails in various database tables.

Another important factor to consider is that, this was kinda a distributed monolith. They had started the path towards microservices a few years earlier, but there was still a shared, self-hosted SQLServer database running on EC2.

Many of the microservices have their own databases, but the data changes were synch'd to this shared database as it was still used by other services.

What I didn't know at the time was that they had maxed out on the number of EBS volumes they can attach. They had the biggest, fastest EBS volumes money could buy and they were completely maxed out.

The database was so big they literally reached the vertical scaling limit.

I later learnt that, something like 40% of the data volume were attributed to fake orders.

The ugly

The amount of fake orders in the system added stress to the shared SQLServer database. But at least it wasn't the source of truth for many of their critical services. It was a single point of failure.

Or so I thought.

Two years after I left the company, they had their biggest outage ever and it lasted over several days.

The SQLServer database blew up (figuratively).

I later learnt that the database was used by an internal tool used by customer support. Someone ran a query so expensive it grinded the database to a halt.

Remember, this was a gigantic database. The size of the database was a big contributing factor.

And it turns out many of the microservices that had their own database still needed data from the SQLServer database to operate. So when the database was down, everything failed.

The aftermath

A lot of changes were insigated after the incident. There were significant changes to the technical leadership. They finally addressed many of their long-standing technical debts.

The fake orders were removed and they stopped the practice of running load tests in production (AFAIK).

They finally implemented caching in some parts of the systems - such as the menu service. Amazingly, they never cached menu data because the service team was convinced it was more performant and scalable to load the data from the database...

All and all, they had an amazing turnaround in the engineering department.

The funny thing was, during their outage, all their customers went to their closest competitor and took them offline... so in the end, they didn't suffer a huge business loss.

What can we learn from this?

System isolation matters

In a distributed architecture, every service should have its own database. Sharing databases creates a single point-of-failure. It creates multiple forms of coupling [1] between systems - temporal, topology and format.

Uptime cost has a sunk cost

When it comes to relational databases such as RDS or self-hosted SQLServer, there's a sunk cost fallacy.

Because you're already paying for uptime for a beefy database server already, it's more cost-efficient to reuse it. This economic force pushes towards shared databases.

And since you're sharing a database already, it's just easier to read another service's data directly instead of going through its API.

Before you know it, you have a distributed monolith with so much implicit coupling between services. A single schema change can break multiple, unrelated services. And one expensive query can kill the whole system.

Don't overlook internal tools

So many engineering teams have a blind spot when it comes to their internal tools. There has been many examples of internal tools causing global outages (e.g. by pushing out bad server configs) or security breaches.

When you assess the performance and scalability of your system, pay close to attention to these internal tools.

Never let a good crisis go to waste

As bad as the outage was, they were able to use it as a catalyst to clean up their act. They cleaned up years of technical debt and rethought key architectural decisions.

Something similar also happened at DAZN, where it took a good outage to muster the commitment and drive for a widespread adoption of consumer-driven contract testing.

Nobody likes an outage. But they can be a useful catalyst for good things to follow.

Build for success

Making it in a competitive market is tough. Sometimes your only chance for success is if your competitor slips up.

In this case, my former employer's competitor missed their shot and were later acquired (by my former employer). There's no room for a distant second in the food delivery business.

This reminds me of a hard lesson that I had to learn earlier in my career. That you have to build for success.

Your system needs to be both cost-efficient at low scale and has the ability to quickly scale up when it needs to. Because success can come from a single tweet by a celebrity endorser, or an article, or a competitor having a bad day.

You need to be ready for it!

In many ways, this is exactly why I'm so enamoured with serverless technologies like Lambda.

You can build systems that cost pennies to run [2] when you are starting out. But when success comes, the system can instantly scale to thousands of requests per second [3] without breaking a sweat!

Yes, the cost will go up in the short term, but so would your revenue. At least now you have a chance at sustaining that initial taste of success and hopefully see years of hard work pay off.

Links

[1] Many facets of coupling

[2] He built a hotel booking system that costs $0.82/month to run

[3] Lambda is more scalable than you think

Master Serverless

Join 12K readers and level up you AWS game with just 5 mins a week. Every Monday, I share practical tips, tutorials and best practices for building serverless architectures on AWS.

Read more from Master Serverless

When it comes to building event-driven architectures on AWS, EventBridge has become the de facto service for ingesting, filtering, transforming and distributing events to their desired destinations. It provides a standard envelope encapsulating each event, including metadata like the source, detail type, and timestamp. These fields are useful, but I'm gonna give you several reasons why you should wrap your event payload in its own envelope. For example, like this: 1. Clear separation between...

Serverless is an incredible paradigm, but performance tuning sometimes feels like a black box. You have no control over the infrastructure, but that doesn’t mean you can’t optimize. In this post, let’s look at five ways to take serverless performance to the next level. 1. Right-size Lambda functions With Lambda, you have one lever to control the power and cost of your functions — its memory setting. Both CPU and network bandwidth are allocated proportionally to a function’s memory allocation....

Software systems are getting bigger and more complex. And we are constantly looking for ways to test code in production without risking user experience. Canary deployments is a popular mechanism for rolling out changes incrementally, allowing us to limit the blast radius in case something goes wrong. However, they’re not without limitations. Canary deployments essentially sacrifice a small portion of users for the greater good. But what if you want to gain insights without impacting any real...