|
Synchronous API integrations create temporal coupling [1] between two services based on their respective availability. This is a tighter form of coupling and often necessitates techniques such as retries, exponential delay and fallbacks to compensate. Event-driven architectures, on the other hand, encourage loose coupling. But we are still bound by lessor forms of coupling such as schema coupling. And here lies a question that many students and clients have asked me: “How do I version my event schemas?” In this post, let’s run through some common approaches and why they all suck to some degree and the two most important decisions you need to make. 1. Add version in the event nameFor example, instead of “user.created”, an event will be called “user.created.v1”. Pros:
Cons:
When to use:
2. Add version in the event payload/metadataFor example, you can include a “version” field in the event payload or metadata [2]. Pros:
Cons:
When to use:
3. Use separate streams/topicsThis is more applicable to Kafka or SNS topics. Instead of having one “user-created” topic, you’d have “user-created-v1” and “user-created-v2” topics and so on. Pros:
Cons:
When to use:
4. Use a schema registry and schema ID in the eventInclude a schema ID or fingerprint in the event so the consumer can fetch the schema definition from a schema registry. The consumer can then validate and deserialise the event based on the retrieved schema. Pros:
Cons:
When to use:
———————————– So far, all the approaches involve adding a version number somewhere. They all suffer from an inability to support backward compatibility gracefully. The overhead of supporting backward compatibility lies squarely with the event publishers. Unless you abandon backward compatibility, the publishers must publish multiple versions of the same event. This creates tricky failure cases, e.g.
It’s not possible to roll back the user.created.v1 event. The only thing the publisher can do is retry sending the user.created.v2 event. But what if the error persists? The publisher can’t retry indefinitely, especially if it’s an API handler and needs to respond to user queries quickly. Maybe you can offload the event to a queue so it can be retried asynchronously and/or alert a human operator to investigate. Again, we have replaced one problem (maintaining backward compatibility) with another equally troublesome problem. I think, fundamentally, it comes down to two choices:
———————————– 5. No breaking changes!Always add new fields, never remove/rename existing fields, and never change the data type of existing fields. This is the approach that PostNL took. They also implemented a custom message broker to provide schema registration and validation. Listen to my conversation with Luc van Donkersgoed [3] (principal engineer at PostNL) to learn more. Pros:
Cons:
When to use:
6. Out-of-band translation from new version to old versionInstead of the publishers being responsible for providing old versions of the events for backward compatibility, you can create consumers who are responsible for translating event version N+1 to version N. Whenever you need to introduce a breaking change and create event version N+1, you also create an event consumer whose only job is to convert this new version to the previous version. This translation layer can be implemented and managed by individual publishers or centrally managed by a “translation service”. Pros:
Cons:
When to use:
SummaryThose are six approaches to versioning event schemas. I tend to avoid approaches 1 to 4 because they don’t address the fundamental problem with versioning – how to deal with backward compatibility. I prefer approach no. 5 – to ensure backward compatibility by forbidding breaking changes. It has the shortest distance to the desired outcome, which is to safely evolve event schema without breaking existing consumers. If you want to learn more about building event-driven architectures for the real world, check out my upcoming Production-Ready Serverless boot camp [4]. We cover various topics around event-driven architectures, including design principles, DDD, testing strategy, observability and error handling best practices. Links[1] The many facets of coupling [2] EventBridge best practice: why you should wrap events in event envelopes [3] Event-driven architecture at PostNL with Luc van Donkersgoed |
Join 17K readers and level up you AWS game with just 5 mins a week.
Lambda Durable Functions comes with a handy testing SDK. It makes it easy to test durable executions both locally as well as remotely in the cloud. I find the local test runner particular useful for dealing with wait states because I can simply configure the runner to skip time! However, this does not work for callback operations such as waitForCallback. Unfortunately, the official docs didn't include any examples on how to handle this. So here's my workaround. The handler code Imagine you're...
Lambda Durable Functions is a powerful new feature, but its checkpoint + replay model has a few gotchas. Here are five to watch out for. Non-deterministic code The biggest gotcha is when the code is not deterministic. That is, it might do something different during replay. Remember, when a durable execution is replayed, the handler code is executed from the start. So the code must behave exactly the same given the same input. If you use random numbers, or timestamps to make branching...
Hi, I have just finished adding some content around Lambda Managed Instances (LMI) to my upcoming workshop. I put together a cheatsheet of the important ways that LMI is different from Lambda default and thought maybe you'd find it useful too. You can also download the PDF version below. Lambda default vs. Lambda managed instances.pdf