CS6.8540 lab1 - MapReduce

type

status

date

slug

summary

category

icon

password

Property

Mar 20, 2025 12:14 PM

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a paradel and distributed algorithm on a cluster.

A MapReduce framework is usually composed of three operations:

Map: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of the redundant input data is processed.

Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node.

Reduce: worker nodes now process each group of output data, per key, in parallel.

Another way to look at MapReduce is as a 5-step parallel and distributed computation:

Prepare the Map() input – the "MapReduce system" designates Map processors, assigns the input key K1 that each processor would work on, and provides that processor with all the input data associated with that key.

Run the user-provided Map() code – Map() is run exactly once for each K1 key, generating output organized by key K2.

"Shuffle" the Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the K2 key each processor should work on, and provides that processor with all the Map-generated data associated with that key.

Run the user-provided Reduce() code – Reduce() is run exactly once for each K2key produced by the Map step.

Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome.

Input and Output types of a MapReduce job:

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

Dataflow:

an input reader

a Map function

a partition function

a compare function

a Reduce function

an output writer

Logical view：

The Map and Reduce functions of MapReduce are both defined with respect to data structured in (key, value) pairs. Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain:

The Map function is applied in parallel to every pair (keyed by k1) in the input dataset. This produces a list of pairs (keyed by k2) for each call. After that, the MapReduce framework collects all pairs with the same key (k2) from all lists and groups them together, creating one group for each key.

The Reduce function is then applied in parallel to each group, which in turn produces a collection of values in the same domain:

Each Reduce call typically produces either one key value pair or an empty return, though one call is allowed to return more than one key value pair. The returns of all calls are collected as the desired result list.

Thus the MapReduce framework transforms a list of (key, value) pairs into another list of (key, value) pairs.This behavior is different from the typical functional programming map and reduce combination, which accepts a list of arbitrary values and returns one single value that combines all the values returned by map.

It is necessary but not sufficient to have implementations of the map and reduce abstractions in order to implement MapReduce. Distributed implementations of MapReduce require a means of connecting the processes performing the Map and Reduce phases. This may be a distributed file system. Other options are possible, such as direct streaming from mappers to reducers, or for the mapping processors to serve up their results to reducers that query them.

Example：

Imagine that for a database of 1.1 billion people, one would like to compute the average number of social contacts a person has according to age. In SQL, such a query could be expressed as:

Using MapReduce, the K1 key values could be the integers 1 through 1100, each representing a batch of 1 million records, the K2 key value could be a person's age in years, and this computation could be achieved using the following functions:

Note that in the Reduce function, C is the count of people having in total N contacts, so in the Map function it is natural to write C=1, since every output pair is referring to the contacts of one single person.

LAB：

In this lab you'll build a MapReduce system. You'll implement a worker process that calls application Map and Reduce functions and handles reading and writing files, and a coordinator process that hands out tasks to workers and copes with failed workers. You'll be building something similar to the MapReduce paper. (Note: this lab uses "coordinator" instead of the paper's "master".)

已有的函数：

coordinator.go

Done() → bool // 记录工作是否完成

MakeCoordinator(files []string, nReduce int) → *Coordinato

worker.go

Worker(mapf func(string, string) []KeyValue, reducef func(string, []string) string)

需要关注的几个点：

如何将原文件转换成键值对

如何将键值对进行再次转换？

如何确定某个阶段任务完成？

结构体如何定义？

任务队列如何定义

中间输出如何定义

因为Map Task分割采用的是统一的哈希函数ihash, 所以相同的key一定会被Map Task输出到格式相同的中间文件上。例如在wc任务中, Map Task 1和Map Task 2输入文件中都存在hello这个词, Map Task 1中所有的hello会被输出到mr-out-1-5这个中间文件, 1代表Map Task序号, 5代表被哈希值取模的结果。那么，Map Task 2中所有的hello会被输出到mr-out-2-5这个中间文件。那么Reduce Task 5读取的就是形如mr-out-*-5这样的文件。

worker 如何将自己注册给coordinator

coordinator 如何管理 task？

并行性如何保证

Woker.go

下边是我对 worker 的理解：

所以对于 worker向 coordinator 发送的struct 可以这么定义：

消息的类型如下：

woker 通过两个函数和coordinator 通信：

请求任务CallForTask() -> *MessReply

报告任务完成情况CallForReportStatus(succesType MsgType, taskID int) -> error

GJJ