Decentralized And Ring All Reduce
Last updated
Last updated
Characters: peer-to-peer architecture (no central server), message passing communication, a node communicate with its neighbors.
Decentralized GD and SGD are guaranteed to converge, e.g., .
Convergence rate depends on how well the nodes are connected.
If the nodes are well connected, then it has fast convergence.
If the graph is not strongly connected, then it does not converge.
The server gets the result of reduce (e.g., sum, mean, count.)
Every node gets a copies of the result of reduce.
E.g., all-reduce via reduce+broadcast. See figure 1
E.g., all-reduce via all-to-all communication. See figure 2
E.g., ring all-reduce.
Most computer networks are idle.
m: number of GPUs.
d: number of parameters.
b: network bandwidth.
Then continue update all gpus' gradients with these red blocks.
In addition to being network-optimal, the allreduce approach is much easier to understand and adopt. Users utilize a Message Passing Interface (MPI) implementation such as Open MPI to launch all copies of the TensorFlow program. MPI then transparently sets up the distributed infrastructure necessary for workers to communicate with each other. All the user needs to do is modify their program to average gradients using an allreduce() operation.
We replaced the Baidu ring-allreduce implementation with NCCL. NCCL is NVIDIA’s library for collective communication that provides a highly optimized version of ring-allreduce. NCCL 2 introduced the ability to run ring-allreduce across multiple machines, enabling us to take advantage of its many performance boosting optimizations.
We added support for models that fit inside a single server, potentially on multiple GPUs, whereas the original version only supported models that fit on a single GPU.
Finally, we made several API improvements inspired by feedback we received from a number of initial users. In particular, we implemented a broadcast operation that enforces consistent initialization of the model on all workers. The new API allowed us to cut down the number of operations a user had to introduce to their single GPU program to four.
A comparison of the images processed per second of the Horovod over plain 25GbE TCP and the Horovod with 25GbE RDMA-capable networking when running a distributed training job over different numbers of NVIDIA Pascal GPUs for Inception V3, ResNet-101 and VGG-16.
Since both MPI and NCCL support remote direct memory access (RDMA) capable networking (e.g., via InfiniBand or RDMA over Converged Ethernet), we ran additional sets of benchmarking tests using RDMA network cards to determine if they helped us enhance efficiency compared to TCP networking.
Note: is considered as a single gradient and will only be transfered once. It will not transfer and separately.
Then Continue update all gpus with gradient .
Communication time: . (Ignore latency.)