MapReduce

Slides Youtube

MapReduce

  • MapReduce is a programming model and software system developed by Google .

  • Characters: client-server architecture, message-passing communication, and bulk synchronous parallel.

  • Apache Hadoop is an open-source implementation of MapReduce.

  • Apache Spark is an improved open-source MapReduce.

Broadcast

mapreduce-1

Map

mapreduce-2

Reduce

mapreduce-3

Data Parallelism

Partition the data among worker nodes. (A node has a subset of data.)

data_parallelism_1

Parallel Gradient Descent Using MapReduce

  • Broadcast: Server broadcast the up-to-date parameters wtw_t o workers.

  • Map: Workers do computation locally.

    • Map (xi,yi,wt)(x_i,y_i,w_t) to gi=(xiTwt−yi)xig_i=(x_i^T w_t-yi)x_i.

    • Obtain nn vectors: g1,g2,g3,...,gng_1, g_2,g_3,...,g_n

  • Reduce: Compute the sum: g=∑i=1ngig=\sum_{i=1}^{n}g_i

  • Every worker sums all the gi{g_i} stored in its local memory to get a vector.

  • Then, the server sums the resulting m vectors. (There are m workers.)

  • Server updates the parameters: wt+1=tt−α⋅gw_{t+1}=t_t-\alpha \cdot g

Parallel_Gradient_Descent_Using_MapReduce_1

Speedup Ratio

speedup_ratio_1

Communication Cost

  • Communication complexity: How many words are transmitted between server and workers.

    • Proportional to number of parameters.

    • Grow with number of worker nodes.

  • Latency: How much time it takes for a packet of data to get from one point to another. (Determined by the compute network.)

  • Communication time: comlexitybancwith+latency\frac{comlexity}{bancwith}+latency

Bulk Synchronous

Bulk_Synchronous_1

Synchronization Cost

Question: What if a node fails and then restart?

  • This node will be much slower than all the others.

  • It is called straggler.

  • Straggler effect:

    • The wall-clock time is determined by the slowest node.

    • It is a consequence of synchronization.

Footnote

Last updated

Was this helpful?