MapReduce

Slides Youtube

MapReduce

  • MapReduce is a programming model and software system developed by Google .

  • Characters: client-server architecture, message-passing communication, and bulk synchronous parallel.

  • Apache Hadoop is an open-source implementation of MapReduce.

  • Apache Spark is an improved open-source MapReduce.

Broadcast

Map

Reduce

Data Parallelism

Partition the data among worker nodes. (A node has a subset of data.)

Parallel Gradient Descent Using MapReduce

  • Map: Workers do computation locally.

  • Then, the server sums the resulting m vectors. (There are m workers.)

Speedup Ratio

Communication Cost

  • Communication complexity: How many words are transmitted between server and workers.

    • Proportional to number of parameters.

    • Grow with number of worker nodes.

  • Latency: How much time it takes for a packet of data to get from one point to another. (Determined by the compute network.)

Bulk Synchronous

Synchronization Cost

Question: What if a node fails and then restart?

  • This node will be much slower than all the others.

  • It is called straggler.

  • Straggler effect:

    • The wall-clock time is determined by the slowest node.

    • It is a consequence of synchronization.

Footnote

Last updated