How to detect and prevent breaking changes in event schemas


Last week, we looked at 6 ways to version event schemas [1] and found the best solution is to avoid breaking changes and minimise the need for versioning.

But how exactly do you do that?

How can you prevent accidental breaking changes from creeping in?

You can detect and stop breaking changes:

  • At runtime, when the events are ingested;
  • During development, when schema changes are made;
  • Or a combination of both!

Here are three approaches you should consider.

1. Consumer-Driven Contracts

In consumer-driven contract testing, the consumer writes tests that express their expectations of the provider. These expectations become a contract file that the provider checks before making any change. This ensures the provider does not make changes that will break the consumers.

Pact is a popular framework for consumer-driven contract testing.

In a typical API-to-API scenario, a consumer test simulates an HTTP request and its expected JSON response, Pact captures that as a contract, and the provider verifies it before merging any changes.

The same flow applies to events - the consumer test specifies the topic or queue plus the exact payload, Pact saves it as a message contract, and the event producer runs that pact to make sure its events match the consumer's expectations.

Pros

  • Catch breaking schema changes early.
  • Keeps teams loosely coupled.
  • Flexibility - the provider can still make breaking schema changes that don't affect its consumers.

Cons

  • Additional operational overhead for maintaining pact files and the pact broker.
  • Requires comprehensive test coverage or gaps will slip through.
  • Requires buy-in from the entire organization to be truly useful.
  • Still allows breaking changes when change is not covered by a consumer test - this can be problematic when replaying old events. For example, when a new consumer needs to seed its data based on historical events.

2. Integration tests with schema packages

In large organizations, getting widespread buy-in for consumer-driven contract testing is impractical without some sort of "shock event" that focuses the minds*.

That's why the teams at LEGO took a different approach [2] to contract testing their event-driven architecture.

The idea is pretty simple.

Event publishers publish their schemas as NPM packages, and event consumers write tests against these packages.

Pros

  • Low barrier to entry - no need to set up a Pact broker or wait for broad org-wide adoption of a new tool.

Cons

  • Doesn't work as well cross programming languages.
  • Lacks enforcement power. The publishers do not know about the consumer tests and do not need to exercise them before commiting changes. This leaves plenty of room for breaking changes to creep in before the consumers would notice.

This approach is simple, but not very effective at catching breaking changes early. Instead, it focuses on giving consumers confidence that their code works against the latest event schema from the publisher.

-----

* At a previous employer, we successfully introduced Pact due to several high-profile outages caused by integration issues. They gave us the political energy to align everyone's short-term priorities to push through an organization-wide change. As the saying goes, "Never let a good crisis go to waste".

3. Schema registry and broker-side validation

Both approaches above rely on testing to catch breaking changes during development. However, these are only as effective as their test coverage.

PostNL takes yet another approach [3], which combines the use of a schema registry and an event broker.

The schema registry serves as the single source of truth for event schemas.

When a producer publishes an updated schema, the registry applies compatibility rules to make sure there are no breaking changes - e.g. fields are not removed, and their data types have not changed.

If the change violates these rules, the registry rejects the schema, preventing incompatible versions from ever being registered.

This provides early feedback to event publishers, preventing breaking changes before they go into production.

Furthermore, as each event is ingested, the event broker looks up its schema in the registry and verifies that the payload matches the expected structure and data types.

Invalid messages can be rejected, quarantined, or routed to a dead-letter queue, so only valid events reach consumers.

Pros

  • Centralized governance - a single source of truth for all event schemas, simplifying management and versioning.
  • The combination of publish time and runtime validations provides complete protection against breaking changes.
  • Simplifies things for both event publishers and consumers as validation is taken care of by these central resources.

Cons

  • Running and scaling the schema registry and broker adds operational complexity and cost.
  • Compatibility rules cover structure and data types, but can’t enforce business‑specific expectations. For example, what events should be fired when?
  • Schema lookups and validation add latency overhead.
  • Can be overly restrictive. For example, breaking changes are not allowed even when there are no consumers.

Summary

Consumer‑driven contracts enforce each consumer’s exact event expectations at development time. It's ideal when you need to validate business‑specific rules and have teams buying into writing integration tests and achieving a high test coverage.

Alternatively, publishers can share event schemas as code libraries for consumers to test against. This is simple to set up and does not require organization-wide buy-in to be effective. However, it's not effective in preventing breaking changes.

Using a schema registry and broker-side validation gives you centralised governance and provides complete protection against breaking changes. However, it adds operational overhead and can't enforce business‑specific expectations.

You can also combine these techniques.

For example, use registry + broker validation to block accidental breaking changes and add consumer-driven contract tests to verify that specific business events fire exactly when and how you expect.

Links

[1] Event versioning strategies for event-driven architectures

[2] How LEGO approaches contract testing

[3] (Podcast) Event-driven architecture at PostNL with Luc van Donkersgoed

Master Serverless

Join 15K readers and level up you AWS game with just 5 mins a week.

Read more from Master Serverless

Synchronous API integrations create temporal coupling [1] between two services based on their respective availability. This is a tighter form of coupling and often necessitates techniques such as retries, exponential delay and fallbacks to compensate. Event-driven architectures, on the other hand, encourage loose coupling. But we are still bound by lessor forms of coupling such as schema coupling. And here lies a question that many students and clients have asked me: “How do I version my...

ICYMI, Serverless Inc. recently announced the Serverless Container Framework. It allows you to switch the compute platform between Lambda and Fargate with a one-liner config change. This is a game-changer for many organizations! It'd hopefully nullify many of the "lock-in" worries about Lambda, too. As your system grows, if Lambda gets expensive, you can easily switch to Fargate without changing your application code. To be clear, this is something you can already do yourself. It's not a...

During this week's live Q&A session, a student from the Production-Ready Serverless boot camp asked a really good question (to paraphrase): "When end-to-end testing an Event-Driven Architecture, how do you limit the scope of the tests so you don't trigger downstream event consumers?" This is a common challenge in event-driven architectures, especially when you have a shared event bus. The Problem As you exercise your system through these tests, the system can generate events that are consumed...