Event versioning strategies for event-driven architectures


Synchronous API integrations create temporal coupling [1] between two services based on their respective availability. This is a tighter form of coupling and often necessitates techniques such as retries, exponential delay and fallbacks to compensate.

Event-driven architectures, on the other hand, encourage loose coupling. But we are still bound by lessor forms of coupling such as schema coupling. And here lies a question that many students and clients have asked me:

“How do I version my event schemas?”

In this post, let’s run through some common approaches and why they all suck to some degree and the two most important decisions you need to make.

1. Add version in the event name

For example, instead of “user.created”, an event will be called “user.created.v1”.

Pros:

  • Easy to route different versions to different consumers.
  • Consumers opt in to a version explicitly.
  • Older consumers are unaffected by new versions.

Cons:

  • Backward compatibility requires duplicated events – i.e. the publisher must publish multiple versions of the same event to ensure all consumers receive the event.

When to use:

  • When there are breaking changes – e.g. field name/type changes.
  • When consumers evolve at different speeds.

2. Add version in the event payload/metadata

For example, you can include a “version” field in the event payload or metadata [2].

Pros:

  • Consumers can switch logic based on the version field.
  • Introducing non-breaking changes is easier – just mark the new schema as version 2.

Cons:

  • Backward compatibility still requires duplicated events. That is, if you don’t want to break consumers that still depend on the version 1 schema.
  • The consumer code becomes (temporarily) more complex if it needs to handle multiple versions. For example, in a coordinated update, the consumers must be updated to support both version 1 and 2 of the event schema before the publisher is updated to publish only version 2 of the event.
  • It’s harder to guarantee version safety without explicit opt-in (i.e. including the “version” field in the subscription filter). Without explicit opt-in, an unsuspecting consumer might suddenly receive events it’s not able to process (it needs v1 but receives v2). Or it might receive duplicate events (v1 and v2) if the publisher sends both to maintain backward compatibility.

When to use:

  • When consumers are tightly integrated and coordinated updates are feasible. For example, when the same team owns both consumers and publishers.
  • When centralizing topics or simplifying infrastructure is important.

3. Use separate streams/topics

This is more applicable to Kafka or SNS topics. Instead of having one “user-created” topic, you’d have “user-created-v1” and “user-created-v2” topics and so on.

Pros:

  • Clear isolation between versions at the infrastructure level.
  • Consumers subscribe to exactly what they want.
  • Consumers opt in to a version explicitly.
  • Older consumers are unaffected by new versions.

Cons:

  • Can lead to topic sprawl.
  • Harder to aggregate or replay across versions.
  • Backward compatibility still requires duplicated events.

When to use:

  • When using a message broker that supports many topics – e.g. Kafka or SNS.
  • When consumers should not see multiple versions of the same event.

4. Use a schema registry and schema ID in the event

Include a schema ID or fingerprint in the event so the consumer can fetch the schema definition from a schema registry.

The consumer can then validate and deserialise the event based on the retrieved schema.

Pros:

  • Central schema management.
  • Consumers can validate and deserialize events reliably.
  • Existing tooling support from the likes of Avro/Protobuf/Thrift.

Cons:

  • Adds a dependency (temporal coupling!) on a schema registry.
  • Does not solve the versioning problem – you still need a strategy for evolving the consumer to use new versions of the events or maintain backward compatibility.
  • Backward compatibility still requires duplicated events.

When to use:

  • When you use Avro/Protobuf/Thrift and have tooling support.
  • When governance and compliance matter (e.g., banking, healthcare).

———————————–

So far, all the approaches involve adding a version number somewhere.

They all suffer from an inability to support backward compatibility gracefully.

The overhead of supporting backward compatibility lies squarely with the event publishers. Unless you abandon backward compatibility, the publishers must publish multiple versions of the same event. This creates tricky failure cases, e.g.

  1. The publisher successfully sends the user.created.v1 event.
  2. The publisher fails to send the user the user.created.v2 event.

It’s not possible to roll back the user.created.v1 event. The only thing the publisher can do is retry sending the user.created.v2 event.

But what if the error persists?

The publisher can’t retry indefinitely, especially if it’s an API handler and needs to respond to user queries quickly.

Maybe you can offload the event to a queue so it can be retried asynchronously and/or alert a human operator to investigate.

Again, we have replaced one problem (maintaining backward compatibility) with another equally troublesome problem.

I think, fundamentally, it comes down to two choices:

  • Can we do away with versioning altogether?
  • Can we find a way to support backward compatibility that doesn’t overcomplicate things for the publishers?

———————————–

5. No breaking changes!

Always add new fields, never remove/rename existing fields, and never change the data type of existing fields.

This is the approach that PostNL took. They also implemented a custom message broker to provide schema registration and validation. Listen to my conversation with Luc van Donkersgoed [3] (principal engineer at PostNL) to learn more.

Pros:

  • No need for version numbers.
  • Consumers keep working if they only rely on known fields.

Cons:

  • Requires discipline and tooling to avoid accidentally introducing breaking changes.
  • Doesn’t work well if you need to remove or rename fields. Instead of duplicating events, the publishers must duplicate fields whenever they need to introduce a breaking change. It’s not as disruptive as duplicating events, but it’s still a complexity that needs to be carried in the publisher code, potentially forever.

When to use:

  • When you can enforce strict schema validations (e.g., with schema registries and/or event brokers).
  • When you want to avoid versioning altogether.

6. Out-of-band translation from new version to old version

Instead of the publishers being responsible for providing old versions of the events for backward compatibility, you can create consumers who are responsible for translating event version N+1 to version N.

Whenever you need to introduce a breaking change and create event version N+1, you also create an event consumer whose only job is to convert this new version to the previous version.

This translation layer can be implemented and managed by individual publishers or centrally managed by a “translation service”.

Pros:

  • Consumers always see the version they expect.
  • Move compatibility logic outside the business logic.
  • Schema evolution can be centrally managed.

Cons:

  • It adds infrastructure overhead.
  • It adds additional event processing costs.
  • It creates more moving parts and more potential points of failure.

When to use:

  • When you have many consumers and want to shield them from schema changes.
  • When schema evolution needs to be managed centrally, e.g. by a platform team.

Summary

Those are six approaches to versioning event schemas.

I tend to avoid approaches 1 to 4 because they don’t address the fundamental problem with versioning – how to deal with backward compatibility.

I prefer approach no. 5 – to ensure backward compatibility by forbidding breaking changes. It has the shortest distance to the desired outcome, which is to safely evolve event schema without breaking existing consumers.

If you want to learn more about building event-driven architectures for the real world, check out my upcoming Production-Ready Serverless boot camp [4]. We cover various topics around event-driven architectures, including design principles, DDD, testing strategy, observability and error handling best practices.

Links

[1] The many facets of coupling

[2] EventBridge best practice: why you should wrap events in event envelopes

[3] Event-driven architecture at PostNL with Luc van Donkersgoed

[4] Production-Ready Serverless boot camp

Master Serverless

Join 15K readers and level up you AWS game with just 5 mins a week.

Read more from Master Serverless

Last week, we looked at 6 ways to version event schemas [1] and found the best solution is to avoid breaking changes and minimise the need for versioning. But how exactly do you do that? How can you prevent accidental breaking changes from creeping in? You can detect and stop breaking changes: At runtime, when the events are ingested; During development, when schema changes are made; Or a combination of both! Here are three approaches you should consider. 1. Consumer-Driven Contracts In...

ICYMI, Serverless Inc. recently announced the Serverless Container Framework. It allows you to switch the compute platform between Lambda and Fargate with a one-liner config change. This is a game-changer for many organizations! It'd hopefully nullify many of the "lock-in" worries about Lambda, too. As your system grows, if Lambda gets expensive, you can easily switch to Fargate without changing your application code. To be clear, this is something you can already do yourself. It's not a...

During this week's live Q&A session, a student from the Production-Ready Serverless boot camp asked a really good question (to paraphrase): "When end-to-end testing an Event-Driven Architecture, how do you limit the scope of the tests so you don't trigger downstream event consumers?" This is a common challenge in event-driven architectures, especially when you have a shared event bus. The Problem As you exercise your system through these tests, the system can generate events that are consumed...