Step Functions lets you set a timeout on By default, a AWS best practices recommend using timeouts to avoid such scenarios [1]. So it’s important to consider what happens when you experience a timeout You can use the But what happens when the whole execution times out? How can we catch and handle execution timeouts like we do with Here are 3 ways to do it. EventBridgeStandard Workflows publish The event contains the state machine ARN, execution name, input and output. We can even use the execution ARN to fetch the full audit history of the execution. That should give us everything we need to figure out what happened. Unfortunately, this approach only works for Standard Workflows. Express Workflows do not emit events to EventBridge. CloudWatch LogsBoth Standard and Express Workflows can write logs to CloudWatch. When an execution times out, it writes a log event like this: We can use CloudWatch log subscription to send these events to a Lambda function to handle the timeout. However, these log events are not as easy to use as the EventBridge events. We can extract the state machine name and execution name from the execution ARN. But not the input and output. For Standard Workflows, we can use the GetExecutionHistory [2] API to fetch the execution history. But this does not support Express Workflows. Instead, we must rely on the audit history logged to CloudWatch. These are not always available. Because we will likely set the log level to This approach can work for both Standard and Express Workflows. However, it might not be practical because the log event provides limited information about the execution. Nested workflowsWe can solve the abovementioned problems by nesting our state machine inside a parent Standard Workflow. ✅ Works for both Standard and Express Workflows. ✅ We have the input and output for the execution. This is a simple and elegant solution. It’s definitely my favourite approach for handling execution timeouts. Honourable mentionsThere are other variants besides the approaches we discussed here. You can even turn this problem into an ad-hoc scheduling problem. For example, you can send a message to SQS with a delivery delay matching the state machine timeout. Or create a schedule in EventBridge Scheduler to be executed when the state machine would have timed out. In both cases, you run into the limitation that Step Functions’ This makes it difficult to find out if an execution timed out in the end. It’s only possible to do this by querying CloudWatch Logs. I don’t think the extra complexity and cost are worth it. So, I’d recommend using one of the three proposed solutions here instead. Links |
Join 16K readers and level up you AWS game with just 5 mins a week.
A common challenge when building APIs is supporting multiple live versions without letting the system turn into an unmaintainable mess. You need to keep older versions running for existing clients, roll out new versions safely, and avoid breaking changes that might take down production. And you need to do all that without duplicating too much infrastructure or introducing spaghetti logic inside your code. There’s no official pattern for versioning APIs in API Gateway + Lambda. API Gateway...
I recently shared six event versioning strategies for event-driven architectures [1]. In response to this, Marty Pitt reached out and showed me how Orbital [2] and Taxi [3] use semantic tags to eliminate schema coupling in event-driven architectures and simplify the schema management. It's a novel way to manage schema evolution, and I want to share what I learnt with you. Problems with Schema Coupling In an event-driven architecture, event consumers are typically coupled to the schema of the...
Last week, we looked at 6 ways to version event schemas [1] and found the best solution is to avoid breaking changes and minimise the need for versioning. But how exactly do you do that? How can you prevent accidental breaking changes from creeping in? You can detect and stop breaking changes: At runtime, when the events are ingested; During development, when schema changes are made; Or a combination of both! Here are three approaches you should consider. 1. Consumer-Driven Contracts In...