Do you know your Fan-Out/Fan-In from Map-Reduce?


Many students and clients have asked me how to implement Map-Reduce workloads serverlessly. In most cases, they are actually asking about Fan-Out/Fan-In!

At a glance, the two patterns look very similar and they are often used interchangeably in conversations. So in this post, let's compare them and see how they differ.

Why? Because names matter ;-)

Fan-Out/Fan-In

Fan-Out and Fan-In are two patterns that are often used together to divide and conquer a large task by:

  1. Divide the task into smaller subtasks (Fan-Out);
  2. Process each subtask in parallel;
  3. Collect the results of the subtasks into a single result (Fan-in).

You can also use the Fan-Out pattern without the the Fan-In step. For example, if you don't need to capture and return the result of processing these subtasks.

Map-Reduce

Whereas Fan-Out/Fan-In is a general pattern for parallel processing. Map-Reduce has a specific structure that involves "map" and "reduce" steps. Typically, a Map-Reduce framework (such as Hadoop) has the following steps:

  • Map: The input data is divided into chunks, and a map function processes each chunk to produce an output;
  • Shuffle: The intermediate key-value pairs are shuffled and sorted to group all values associated with the same key;
  • Reduce: A reduce function processes each group of intermediate values to produce the final output.

Map-Reduce is typically used to process large amounts of data across a fleet of worker nodes. Data locality is an important performance consideration (to avoid making costly network calls).

Fan-Out/Fan-In vs. Map-Reduce

You can think of Map-Reduce as a specific flavour of Fan-Out/Fan-In. It has a particular structure that involves "map" and "reduce" steps.

They differ in some subtle and important ways.

Use cases

Fan-Out/Fan-In is commonly used in scenarios where independent tasks can be processed in parallel. There's no need to aggregate intermediate results and group them. For example, web scraping and making concurrent API calls to 3rd party services.

Map-Reduce is typically used for large-scale data processing such as indexing, log analysis, and data transformations. It's well suited for situations when you need to group and aggregate results by an intermdiate key. For example, if you need to query TBs of user click-stream data and calculate the percentage of website visitors who have clicked a link.

Data locality

With Fan-Out/Fan-In, data locality is typically not a major concern. Because each task is processed independently and the results are only aggregated at the end.

With Map-Reduce, data locality is crucial for minimizing data transfers across the network. That's why it has a "shuffle" step to group related intermediate results so they can be processed together.

Serverless implementation

There are many ways to implement Fan-Out/Fan-In using serverless services such as Lambda and Step Functions.

In fact, I wrote about several ways to do exactly this back in 2018. A lot has changed since then, so in the next post, I will share my thoughts on the best ways to implement Fan-Out/Fan-In serverlessly in 2024.

In the context of serverless, if you're using Lambda, you're probably not using Hadoop or Spark (there's no clear integration path). And when you use services such as Lambda or Step Functions, you also don't have access to the underlying worker nodes.

However, it's possible to perform Map-Reduce jobs with other serverless services on AWS. For example, AWS Glue offers a serverless ETL service that lets you run Spark jobs. Amazon EMR also gives you a managed Hadoop environment.

Summary

Both Fan-Out/Fan-In and Map-Reduce patterns execute tasks in parallel, but they serve different purposes and are suited to different types of workloads.

Fan-Out/Fan-In is a more general pattern applicable to a wide range of parallel processing tasks, whereas Map-Reduce is a specific pattern that's designed for big data processing.

I hope you've found this post useful and answers a questions that you have always wondered about! ;-)

Master Serverless

Join 16K readers and level up you AWS game with just 5 mins a week.

Read more from Master Serverless

A common challenge when building APIs is supporting multiple live versions without letting the system turn into an unmaintainable mess. You need to keep older versions running for existing clients, roll out new versions safely, and avoid breaking changes that might take down production. And you need to do all that without duplicating too much infrastructure or introducing spaghetti logic inside your code. There’s no official pattern for versioning APIs in API Gateway + Lambda. API Gateway...

I recently shared six event versioning strategies for event-driven architectures [1]. In response to this, Marty Pitt reached out and showed me how Orbital [2] and Taxi [3] use semantic tags to eliminate schema coupling in event-driven architectures and simplify the schema management. It's a novel way to manage schema evolution, and I want to share what I learnt with you. Problems with Schema Coupling In an event-driven architecture, event consumers are typically coupled to the schema of the...

Last week, we looked at 6 ways to version event schemas [1] and found the best solution is to avoid breaking changes and minimise the need for versioning. But how exactly do you do that? How can you prevent accidental breaking changes from creeping in? You can detect and stop breaking changes: At runtime, when the events are ingested; During development, when schema changes are made; Or a combination of both! Here are three approaches you should consider. 1. Consumer-Driven Contracts In...