Synchronous API integrations create temporal coupling [1] between two services based on their respective availability. This is a tighter form of coupling and often necessitates techniques such as retries, exponential delay and fallbacks to compensate. Event-driven architectures, on the other hand, encourage loose coupling. But we are still bound by lessor forms of coupling such as schema coupling. And here lies a question that many students and clients have asked me: “How do I version my event schemas?” In this post, let’s run through some common approaches and why they all suck to some degree and the two most important decisions you need to make. 1. Add version in the event nameFor example, instead of “user.created”, an event will be called “user.created.v1”. Pros:
Cons:
When to use:
2. Add version in the event payload/metadataFor example, you can include a “version” field in the event payload or metadata [2]. Pros:
Cons:
When to use:
3. Use separate streams/topicsThis is more applicable to Kafka or SNS topics. Instead of having one “user-created” topic, you’d have “user-created-v1” and “user-created-v2” topics and so on. Pros:
Cons:
When to use:
4. Use a schema registry and schema ID in the eventInclude a schema ID or fingerprint in the event so the consumer can fetch the schema definition from a schema registry. The consumer can then validate and deserialise the event based on the retrieved schema. Pros:
Cons:
When to use:
———————————– So far, all the approaches involve adding a version number somewhere. They all suffer from an inability to support backward compatibility gracefully. The overhead of supporting backward compatibility lies squarely with the event publishers. Unless you abandon backward compatibility, the publishers must publish multiple versions of the same event. This creates tricky failure cases, e.g.
It’s not possible to roll back the user.created.v1 event. The only thing the publisher can do is retry sending the user.created.v2 event. But what if the error persists? The publisher can’t retry indefinitely, especially if it’s an API handler and needs to respond to user queries quickly. Maybe you can offload the event to a queue so it can be retried asynchronously and/or alert a human operator to investigate. Again, we have replaced one problem (maintaining backward compatibility) with another equally troublesome problem. I think, fundamentally, it comes down to two choices:
———————————– 5. No breaking changes!Always add new fields, never remove/rename existing fields, and never change the data type of existing fields. This is the approach that PostNL took. They also implemented a custom message broker to provide schema registration and validation. Listen to my conversation with Luc van Donkersgoed [3] (principal engineer at PostNL) to learn more. Pros:
Cons:
When to use:
6. Out-of-band translation from new version to old versionInstead of the publishers being responsible for providing old versions of the events for backward compatibility, you can create consumers who are responsible for translating event version N+1 to version N. Whenever you need to introduce a breaking change and create event version N+1, you also create an event consumer whose only job is to convert this new version to the previous version. This translation layer can be implemented and managed by individual publishers or centrally managed by a “translation service”. Pros:
Cons:
When to use:
SummaryThose are six approaches to versioning event schemas. I tend to avoid approaches 1 to 4 because they don’t address the fundamental problem with versioning – how to deal with backward compatibility. I prefer approach no. 5 – to ensure backward compatibility by forbidding breaking changes. It has the shortest distance to the desired outcome, which is to safely evolve event schema without breaking existing consumers. If you want to learn more about building event-driven architectures for the real world, check out my upcoming Production-Ready Serverless boot camp [4]. We cover various topics around event-driven architectures, including design principles, DDD, testing strategy, observability and error handling best practices. Links[1] The many facets of coupling [2] EventBridge best practice: why you should wrap events in event envelopes [3] Event-driven architecture at PostNL with Luc van Donkersgoed |
Join 15K readers and level up you AWS game with just 5 mins a week.
Last week, we looked at 6 ways to version event schemas [1] and found the best solution is to avoid breaking changes and minimise the need for versioning. But how exactly do you do that? How can you prevent accidental breaking changes from creeping in? You can detect and stop breaking changes: At runtime, when the events are ingested; During development, when schema changes are made; Or a combination of both! Here are three approaches you should consider. 1. Consumer-Driven Contracts In...
ICYMI, Serverless Inc. recently announced the Serverless Container Framework. It allows you to switch the compute platform between Lambda and Fargate with a one-liner config change. This is a game-changer for many organizations! It'd hopefully nullify many of the "lock-in" worries about Lambda, too. As your system grows, if Lambda gets expensive, you can easily switch to Fargate without changing your application code. To be clear, this is something you can already do yourself. It's not a...
During this week's live Q&A session, a student from the Production-Ready Serverless boot camp asked a really good question (to paraphrase): "When end-to-end testing an Event-Driven Architecture, how do you limit the scope of the tests so you don't trigger downstream event consumers?" This is a common challenge in event-driven architectures, especially when you have a shared event bus. The Problem As you exercise your system through these tests, the system can generate events that are consumed...