|
Synchronous API integrations create temporal coupling [1] between two services based on their respective availability. This is a tighter form of coupling and often necessitates techniques such as retries, exponential delay and fallbacks to compensate. Event-driven architectures, on the other hand, encourage loose coupling. But we are still bound by lessor forms of coupling such as schema coupling. And here lies a question that many students and clients have asked me: “How do I version my event schemas?” In this post, let’s run through some common approaches and why they all suck to some degree and the two most important decisions you need to make. 1. Add version in the event nameFor example, instead of “user.created”, an event will be called “user.created.v1”. Pros:
Cons:
When to use:
2. Add version in the event payload/metadataFor example, you can include a “version” field in the event payload or metadata [2]. Pros:
Cons:
When to use:
3. Use separate streams/topicsThis is more applicable to Kafka or SNS topics. Instead of having one “user-created” topic, you’d have “user-created-v1” and “user-created-v2” topics and so on. Pros:
Cons:
When to use:
4. Use a schema registry and schema ID in the eventInclude a schema ID or fingerprint in the event so the consumer can fetch the schema definition from a schema registry. The consumer can then validate and deserialise the event based on the retrieved schema. Pros:
Cons:
When to use:
———————————– So far, all the approaches involve adding a version number somewhere. They all suffer from an inability to support backward compatibility gracefully. The overhead of supporting backward compatibility lies squarely with the event publishers. Unless you abandon backward compatibility, the publishers must publish multiple versions of the same event. This creates tricky failure cases, e.g.
It’s not possible to roll back the user.created.v1 event. The only thing the publisher can do is retry sending the user.created.v2 event. But what if the error persists? The publisher can’t retry indefinitely, especially if it’s an API handler and needs to respond to user queries quickly. Maybe you can offload the event to a queue so it can be retried asynchronously and/or alert a human operator to investigate. Again, we have replaced one problem (maintaining backward compatibility) with another equally troublesome problem. I think, fundamentally, it comes down to two choices:
———————————– 5. No breaking changes!Always add new fields, never remove/rename existing fields, and never change the data type of existing fields. This is the approach that PostNL took. They also implemented a custom message broker to provide schema registration and validation. Listen to my conversation with Luc van Donkersgoed [3] (principal engineer at PostNL) to learn more. Pros:
Cons:
When to use:
6. Out-of-band translation from new version to old versionInstead of the publishers being responsible for providing old versions of the events for backward compatibility, you can create consumers who are responsible for translating event version N+1 to version N. Whenever you need to introduce a breaking change and create event version N+1, you also create an event consumer whose only job is to convert this new version to the previous version. This translation layer can be implemented and managed by individual publishers or centrally managed by a “translation service”. Pros:
Cons:
When to use:
SummaryThose are six approaches to versioning event schemas. I tend to avoid approaches 1 to 4 because they don’t address the fundamental problem with versioning – how to deal with backward compatibility. I prefer approach no. 5 – to ensure backward compatibility by forbidding breaking changes. It has the shortest distance to the desired outcome, which is to safely evolve event schema without breaking existing consumers. If you want to learn more about building event-driven architectures for the real world, check out my upcoming Production-Ready Serverless boot camp [4]. We cover various topics around event-driven architectures, including design principles, DDD, testing strategy, observability and error handling best practices. Links[1] The many facets of coupling [2] EventBridge best practice: why you should wrap events in event envelopes [3] Event-driven architecture at PostNL with Luc van Donkersgoed |
Join 17K readers and level up you AWS game with just 5 mins a week.
Modern applications rarely do just one thing at a time. An API request creates an order, and then another service needs to reserve stock, another to charge the customer, another to send an email, and so on. In a serverless or event-driven architecture, follow-up actions are usually triggered by messages (either events or commands). That gives us loose coupling, better scalability, and independent services. But it also introduces a reliability problem. “What happens when the database update...
If you use Claude Code a lot, you’ve probably run into usage limits, sometimes even in short coding sessions. But cost isn’t the only problem. In long-running sessions, the context window eventually fills up, and that can cause the agent to forget earlier decisions, lose important details, or come back from compaction with gaps in its working memory. Here are three tools worth checking out if you want to reduce token usage and make longer coding sessions possible. 1. CavemanThis is a Claude...
AI agents can now scan an entire open-source codebase for exploitable vulnerabilities in hours. Frontier models carry the complete library of known bug classes in their weights. So you can simply point an AI agent at a codebase and tell it to find zero-days. This isn't theoretical. Willy Tarreau, the HAProxy lead developer, reports that security bug reports have jumped from 2–3 per week to 5–10 per day. Greg Kroah-Hartman, the Linux kernel maintainer, described what happened: "Months ago, we...