The anti-polling pattern for Step Functions


Step Functions is often used to poll long-running processes, e.g. when starting a new data migration task with Amazon Database Migration.

There's usually a Wait -> Poll -> Choice loop that runs until the task is complete (or failed), like the one below.

Polling is inefficient and can add unnecessary cost as standard workflows are charged based on the number of state transitions.

There is an event-driven alternative to this approach.

Here's the high level approach:

  1. To start the data migration, the state machine calls a Lambda function with a task token (required for callback). This pauses the state machine execution.
  2. The Lambda function calls the Database Migration service to start the data migration.
  3. The function saves the data migration ARN (hash key) and the callback token in DynamoDB, along with other relevant information (created date, etc.)
  4. The Database Migration service publishes "StateChange" events to the default EventBridge event bus (see docs here). A Lambda function subscribes to this event and waits for a replication task to finish or fail.
  5. When triggered, the function extracts the data migration ARN from the event payload and retrieves the Step Functions task token from DynamoDB.
  6. It can use the task token to send a success or failure signal back to the state machine execution. From here, the state machine can proceed with the rest of its steps.

This is a more efficient but also more complex approach. There are more moving parts involved, but it's simple to implement.

But what if you're calling a 3rd party API that do not support events?

You can adapt this approach to work with any service that accept a callback URL. When the 3rd party service makes the callback, your API handler will look up the task token and make the callback to Step Functions. Everything else stays the same as before.

What's more, both the polling and event-driven approach can be implemented with the new Lambda Durable Functions too!

The waitForCondition operation is perfect for implementing the polling loop in just a few lines, like this:

Similarly, the waitForCallback operation makes implementing the event-driven approach trivial. Instead of a task token, we have to store a callback ID. As before, the callback can be triggered by an event or by a 3rd party service via a callback URL.

Polling is the default because it’s easy, not because it’s good.

If you can get an event (or a callback), you can stop spinning your state machine and start treating "waiting" as a first-class step. Less noise, fewer transitions, lower cost.

Thank you to Patrick for bringing this up in our last Q&A session. If you want to level up your AWS game, check out my Production-Ready Serverless workshop. The next cohort starts on April 13th, and the early bird tickets (30% off) is available until March 16th.

Until then, ciao!

Master Serverless

Join 17K readers and level up you AWS game with just 5 mins a week.

Read more from Master Serverless

Lambda Durable Functions comes with a handy testing SDK. It makes it easy to test durable executions both locally as well as remotely in the cloud. I find the local test runner particular useful for dealing with wait states because I can simply configure the runner to skip time! However, this does not work for callback operations such as waitForCallback. Unfortunately, the official docs didn't include any examples on how to handle this. So here's my workaround. The handler code Imagine you're...

Lambda Durable Functions is a powerful new feature, but its checkpoint + replay model has a few gotchas. Here are five to watch out for. Non-deterministic code The biggest gotcha is when the code is not deterministic. That is, it might do something different during replay. Remember, when a durable execution is replayed, the handler code is executed from the start. So the code must behave exactly the same given the same input. If you use random numbers, or timestamps to make branching...

Hi, I have just finished adding some content around Lambda Managed Instances (LMI) to my upcoming workshop. I put together a cheatsheet of the important ways that LMI is different from Lambda default and thought maybe you'd find it useful too. You can also download the PDF version below. Lambda default vs. Lambda managed instances.pdf