profile

Master Serverless

How to handle execution timeouts in AWS Step Functions

Published 17 days ago • 2 min read

Step Functions lets you set a timeout on Task states and the whole execution.

By default, a Task state times out after 60 seconds. But an execution can run for a year if no TimeoutSeconds is configured. To a user, the execution would appear as “stuck”.

AWS best practices recommend using timeouts to avoid such scenarios [1]. So it’s important to consider what happens when you experience a timeout

You can use the Catch clause to handle the States.Timeout error when a Task state times out. You can then perform automated remediation steps.

But what happens when the whole execution times out? How can we catch and handle execution timeouts like we do with Task states?

Here are 3 ways to do it.

EventBridge

Standard Workflows publish TIMED_OUT events to the default EventBridge bus. We can create an EventBridge rule to match against these events. That way, we can trigger a Lambda function to handle the error.

The event contains the state machine ARN, execution name, input and output. We can even use the execution ARN to fetch the full audit history of the execution.

That should give us everything we need to figure out what happened.

Unfortunately, this approach only works for Standard Workflows. Express Workflows do not emit events to EventBridge.

CloudWatch Logs

Both Standard and Express Workflows can write logs to CloudWatch. When an execution times out, it writes a log event like this:

We can use CloudWatch log subscription to send these events to a Lambda function to handle the timeout.

However, these log events are not as easy to use as the EventBridge events.

We can extract the state machine name and execution name from the execution ARN. But not the input and output.

For Standard Workflows, we can use the GetExecutionHistory [2] API to fetch the execution history. But this does not support Express Workflows. Instead, we must rely on the audit history logged to CloudWatch.

These are not always available. Because we will likely set the log level to ERROR to minimize the cost of CloudWatch Logs.

This approach can work for both Standard and Express Workflows. However, it might not be practical because the log event provides limited information about the execution.

Nested workflows

We can solve the abovementioned problems by nesting our state machine inside a parent Standard Workflow.

✅ Works for both Standard and Express Workflows.

✅ We have the input and output for the execution.

This is a simple and elegant solution. It’s definitely my favourite approach for handling execution timeouts.

Honourable mentions

There are other variants besides the approaches we discussed here. You can even turn this problem into an ad-hoc scheduling problem.

For example, you can send a message to SQS with a delivery delay matching the state machine timeout. Or create a schedule in EventBridge Scheduler to be executed when the state machine would have timed out.

In both cases, you run into the limitation that Step Functions’ DescribeExecution and ListExecutions APIs don’t support Express Workflows.

This makes it difficult to find out if an execution timed out in the end. It’s only possible to do this by querying CloudWatch Logs. I don’t think the extra complexity and cost are worth it. So, I’d recommend using one of the three proposed solutions here instead.

Links

[1] Use timeouts to avoid stuck executions

[2] Step Function’s GetExecutionHistory API

Master Serverless

by Yan Cui, AWS Serverless Hero

Join 8k+ readers and level up you AWS game with just 5 mins a week. Every Monday, I share practical tips, tutorials and best practices for building serverless architectures on AWS.

Read more from Master Serverless

"The offer is strong with this one." Hey there! I've got an awesome offer for you this Star Wars Day. I've partnered with the best AWS content creators to give you 30% off on a fantastic range of AWS books and courses! From left to right: me, Philip Riecks, Sandro Volpicella, Alex DeBrie, Daniel Galati and Tobias Schmidt. Enter the code TBMAPRBD at checkout to get your discount. But hurry, this offer ends in 3 days. Check out these deals: 30% OFF on AppSync Masterclass: Learn fullstack...

5 days ago • 1 min read

I can’t believe it’s May already! It’s been a busy few months here. Here’s what I've been up to and what you might have missed. Blog posts How to handle execution timeouts in AWS Step Functions How to apply the TDD mindset to serverless Here are four ways you can implement WebSockets using serverless DynamoDB now supports cross-account access. But is that a good idea? When to use Step Functions vs. doing it all in a Lambda function When to use API Gateway vs. Lambda Function URLs First...

8 days ago • 2 min read

How to apply the TDD mindset to serverless Read on my blog Read time: 3 minutes. Testing is an integral part of software development. Your tests are a living documentation of your system. They inform others how to use your system, but they are so much more than that. One of the most understood parts of Test-Driven Development (TDD) is the "Driven" part of the name. It's not just about "writing tests before you write the code". If your tests do not inform and drive your API design, then you're...

30 days ago • 3 min read
Share this post