How to handle execution timeouts in AWS Step Functions


Step Functions lets you set a timeout on Task states and the whole execution.

By default, a Task state times out after 60 seconds. But an execution can run for a year if no TimeoutSeconds is configured. To a user, the execution would appear as “stuck”.

AWS best practices recommend using timeouts to avoid such scenarios [1]. So it’s important to consider what happens when you experience a timeout

You can use the Catch clause to handle the States.Timeout error when a Task state times out. You can then perform automated remediation steps.

But what happens when the whole execution times out? How can we catch and handle execution timeouts like we do with Task states?

Here are 3 ways to do it.

EventBridge

Standard Workflows publish TIMED_OUT events to the default EventBridge bus. We can create an EventBridge rule to match against these events. That way, we can trigger a Lambda function to handle the error.

The event contains the state machine ARN, execution name, input and output. We can even use the execution ARN to fetch the full audit history of the execution.

That should give us everything we need to figure out what happened.

Unfortunately, this approach only works for Standard Workflows. Express Workflows do not emit events to EventBridge.

CloudWatch Logs

Both Standard and Express Workflows can write logs to CloudWatch. When an execution times out, it writes a log event like this:

We can use CloudWatch log subscription to send these events to a Lambda function to handle the timeout.

However, these log events are not as easy to use as the EventBridge events.

We can extract the state machine name and execution name from the execution ARN. But not the input and output.

For Standard Workflows, we can use the GetExecutionHistory [2] API to fetch the execution history. But this does not support Express Workflows. Instead, we must rely on the audit history logged to CloudWatch.

These are not always available. Because we will likely set the log level to ERROR to minimize the cost of CloudWatch Logs.

This approach can work for both Standard and Express Workflows. However, it might not be practical because the log event provides limited information about the execution.

Nested workflows

We can solve the abovementioned problems by nesting our state machine inside a parent Standard Workflow.

✅ Works for both Standard and Express Workflows.

✅ We have the input and output for the execution.

This is a simple and elegant solution. It’s definitely my favourite approach for handling execution timeouts.

Honourable mentions

There are other variants besides the approaches we discussed here. You can even turn this problem into an ad-hoc scheduling problem.

For example, you can send a message to SQS with a delivery delay matching the state machine timeout. Or create a schedule in EventBridge Scheduler to be executed when the state machine would have timed out.

In both cases, you run into the limitation that Step Functions’ DescribeExecution and ListExecutions APIs don’t support Express Workflows.

This makes it difficult to find out if an execution timed out in the end. It’s only possible to do this by querying CloudWatch Logs. I don’t think the extra complexity and cost are worth it. So, I’d recommend using one of the three proposed solutions here instead.

Links

[1] Use timeouts to avoid stuck executions

[2] Step Function’s GetExecutionHistory API

Master Serverless

Join 17K readers and level up you AWS game with just 5 mins a week.

Read more from Master Serverless

Lambda Durable Functions makes it easy to implement business workflows using plain Lambda functions. Besides the intended use cases, they also let us implement ETL jobs without needing recursions or Step Functions. Many long-running ETL jobs have a time-consuming, sequential steps that cannot be easily parallelised. For example: Fetching data from shared databases/APIs with throughput limits. When data needs to be processed sequentially. Historically, Lambda was not a good fit for these...

Step Functions is often used to poll long-running processes, e.g. when starting a new data migration task with Amazon Database Migration. There's usually a Wait -> Poll -> Choice loop that runs until the task is complete (or failed), like the one below. Polling is inefficient and can add unnecessary cost as standard workflows are charged based on the number of state transitions. There is an event-driven alternative to this approach. Here's the high level approach: To start the data migration,...

Lambda Durable Functions comes with a handy testing SDK. It makes it easy to test durable executions both locally as well as remotely in the cloud. I find the local test runner particular useful for dealing with wait states because I can simply configure the runner to skip time! However, this does not work for callback operations such as waitForCallback. Unfortunately, the official docs didn't include any examples on how to handle this. So here's my workaround. The handler code Imagine you're...