The anti-polling pattern for Step Functions

Step Functions is often used to poll long-running processes, e.g. when starting a new data migration task with Amazon Database Migration.

There's usually a Wait -> Poll -> Choice loop that runs until the task is complete (or failed), like the one below.

Polling is inefficient and can add unnecessary cost as standard workflows are charged based on the number of state transitions.

There is an event-driven alternative to this approach.

Here's the high level approach:

To start the data migration, the state machine calls a Lambda function with a task token (required for callback). This pauses the state machine execution.
The Lambda function calls the Database Migration service to start the data migration.
The function saves the data migration ARN (hash key) and the callback token in DynamoDB, along with other relevant information (created date, etc.)
The Database Migration service publishes "StateChange" events to the default EventBridge event bus (see docs here). A Lambda function subscribes to this event and waits for a replication task to finish or fail.
When triggered, the function extracts the data migration ARN from the event payload and retrieves the Step Functions task token from DynamoDB.
It can use the task token to send a success or failure signal back to the state machine execution. From here, the state machine can proceed with the rest of its steps.

This is a more efficient but also more complex approach. There are more moving parts involved, but it's simple to implement.

But what if you're calling a 3rd party API that do not support events?

You can adapt this approach to work with any service that accept a callback URL. When the 3rd party service makes the callback, your API handler will look up the task token and make the callback to Step Functions. Everything else stays the same as before.

What's more, both the polling and event-driven approach can be implemented with the new Lambda Durable Functions too!

The waitForCondition operation is perfect for implementing the polling loop in just a few lines, like this:

Similarly, the waitForCallback operation makes implementing the event-driven approach trivial. Instead of a task token, we have to store a callback ID. As before, the callback can be triggered by an event or by a 3rd party service via a callback URL.

Polling is the default because it’s easy, not because it’s good.

If you can get an event (or a callback), you can stop spinning your state machine and start treating "waiting" as a first-class step. Less noise, fewer transitions, lower cost.

Thank you to Patrick for bringing this up in our last Q&A session. If you want to level up your AWS game, check out my Production-Ready Serverless workshop. The next cohort starts on April 13th, and the early bird tickets (30% off) is available until March 16th.

Until then, ciao!

Master Serverless

The anti-polling pattern for Step Functions

Lambda Durable Functions: How to implement long-running ETL jobs

Lambda Durable Functions: how to test callbacks

Lambda Durable Functions: 5 gotchas to watch out for