Do you know your Fan-Out/Fan-In from Map-Reduce?


Many students and clients have asked me how to implement Map-Reduce workloads serverlessly. In most cases, they are actually asking about Fan-Out/Fan-In!

At a glance, the two patterns look very similar and they are often used interchangeably in conversations. So in this post, let's compare them and see how they differ.

Why? Because names matter ;-)

Fan-Out/Fan-In

Fan-Out and Fan-In are two patterns that are often used together to divide and conquer a large task by:

  1. Divide the task into smaller subtasks (Fan-Out);
  2. Process each subtask in parallel;
  3. Collect the results of the subtasks into a single result (Fan-in).

You can also use the Fan-Out pattern without the the Fan-In step. For example, if you don't need to capture and return the result of processing these subtasks.

Map-Reduce

Whereas Fan-Out/Fan-In is a general pattern for parallel processing. Map-Reduce has a specific structure that involves "map" and "reduce" steps. Typically, a Map-Reduce framework (such as Hadoop) has the following steps:

  • Map: The input data is divided into chunks, and a map function processes each chunk to produce an output;
  • Shuffle: The intermediate key-value pairs are shuffled and sorted to group all values associated with the same key;
  • Reduce: A reduce function processes each group of intermediate values to produce the final output.

Map-Reduce is typically used to process large amounts of data across a fleet of worker nodes. Data locality is an important performance consideration (to avoid making costly network calls).

Fan-Out/Fan-In vs. Map-Reduce

You can think of Map-Reduce as a specific flavour of Fan-Out/Fan-In. It has a particular structure that involves "map" and "reduce" steps.

They differ in some subtle and important ways.

Use cases

Fan-Out/Fan-In is commonly used in scenarios where independent tasks can be processed in parallel. There's no need to aggregate intermediate results and group them. For example, web scraping and making concurrent API calls to 3rd party services.

Map-Reduce is typically used for large-scale data processing such as indexing, log analysis, and data transformations. It's well suited for situations when you need to group and aggregate results by an intermdiate key. For example, if you need to query TBs of user click-stream data and calculate the percentage of website visitors who have clicked a link.

Data locality

With Fan-Out/Fan-In, data locality is typically not a major concern. Because each task is processed independently and the results are only aggregated at the end.

With Map-Reduce, data locality is crucial for minimizing data transfers across the network. That's why it has a "shuffle" step to group related intermediate results so they can be processed together.

Serverless implementation

There are many ways to implement Fan-Out/Fan-In using serverless services such as Lambda and Step Functions.

In fact, I wrote about several ways to do exactly this back in 2018. A lot has changed since then, so in the next post, I will share my thoughts on the best ways to implement Fan-Out/Fan-In serverlessly in 2024.

In the context of serverless, if you're using Lambda, you're probably not using Hadoop or Spark (there's no clear integration path). And when you use services such as Lambda or Step Functions, you also don't have access to the underlying worker nodes.

However, it's possible to perform Map-Reduce jobs with other serverless services on AWS. For example, AWS Glue offers a serverless ETL service that lets you run Spark jobs. Amazon EMR also gives you a managed Hadoop environment.

Summary

Both Fan-Out/Fan-In and Map-Reduce patterns execute tasks in parallel, but they serve different purposes and are suited to different types of workloads.

Fan-Out/Fan-In is a more general pattern applicable to a wide range of parallel processing tasks, whereas Map-Reduce is a specific pattern that's designed for big data processing.

I hope you've found this post useful and answers a questions that you have always wondered about! ;-)

Master Serverless

Join 17K readers and level up you AWS game with just 5 mins a week.

Read more from Master Serverless

Lambda Durable Functions makes it easy to implement business workflows using plain Lambda functions. Besides the intended use cases, they also let us implement ETL jobs without needing recursions or Step Functions. Many long-running ETL jobs have a time-consuming, sequential steps that cannot be easily parallelised. For example: Fetching data from shared databases/APIs with throughput limits. When data needs to be processed sequentially. Historically, Lambda was not a good fit for these...

Step Functions is often used to poll long-running processes, e.g. when starting a new data migration task with Amazon Database Migration. There's usually a Wait -> Poll -> Choice loop that runs until the task is complete (or failed), like the one below. Polling is inefficient and can add unnecessary cost as standard workflows are charged based on the number of state transitions. There is an event-driven alternative to this approach. Here's the high level approach: To start the data migration,...

Lambda Durable Functions comes with a handy testing SDK. It makes it easy to test durable executions both locally as well as remotely in the cloud. I find the local test runner particular useful for dealing with wait states because I can simply configure the runner to skip time! However, this does not work for callback operations such as waitForCallback. Unfortunately, the official docs didn't include any examples on how to handle this. So here's my workaround. The handler code Imagine you're...