MapReduce
Last updated
Last updated
MapReduce is a programming model and software system developed by Google .
Characters: client-server architecture, message-passing communication, and bulk synchronous parallel.
Apache Hadoop is an open-source implementation of MapReduce.
Apache Spark is an improved open-source MapReduce.
Partition the data among worker nodes. (A node has a subset of data.)
Broadcast: Server broadcast the up-to-date parameters o workers.
Map: Workers do computation locally.
Map to .
Obtain vectors:
Reduce: Compute the sum:
Every worker sums all the stored in its local memory to get a vector.
Then, the server sums the resulting m vectors. (There are m workers.)
Server updates the parameters:
Communication complexity: How many words are transmitted between server and workers.
Proportional to number of parameters.
Grow with number of worker nodes.
Latency: How much time it takes for a packet of data to get from one point to another. (Determined by the compute network.)
Communication time:
Question: What if a node fails and then restart?
This node will be much slower than all the others.
It is called straggler.
Straggler effect:
The wall-clock time is determined by the slowest node.
It is a consequence of synchronization.