ML_101
  • Introduction
  • ML Fundamentals
    • Basics
    • Optimization
    • How to prevent overfitting
    • Linear Algebra
    • Clustering
    • Calculate Parameters in CNN
    • Normalization
    • Confidence Interval
    • Quantization
  • Classical Machine Learning
    • Basics
    • Unsupervised Learning
  • Neural Networks
    • Basics
    • Activation function
    • Different Types of Convolution
    • Resnet
    • Mobilenet
  • Loss
    • L1 and L2 Loss
    • Hinge Loss
    • Cross-Entropy Loss
    • Binary Cross-Entropy Loss
    • Categorical Cross-Entropy Loss
    • (Optional) Focal Loss
    • (Optional) CORAL Loss
  • Computer Vision
    • Two Stage Object Detection
      • Metrics
      • ROI
      • R-CNN
      • Fast RCNN
      • Faster RCNN
      • Mask RCNN
    • One Stage Object Detection
      • FPN
      • YOLO
      • Single Shot MultiBox Detector(SSD)
    • Segmentation
      • Panoptic Segmentation
      • PSPNet
    • FaceNet
    • GAN
    • Imbalance problem in object detection
  • NLP
    • Embedding
    • RNN
    • LSTM
    • LSTM Ext.
    • RNN for text prediction
    • BLEU
    • Seq2Seq
    • Attention
    • Self Attention
    • Attention without RNN
    • Transformer
    • BERT
  • Parallel Computing
    • Communication
    • MapReduce
    • Parameter Server
    • Decentralized And Ring All Reduce
    • Federated Learning
    • Model Parallelism: GPipe
  • Anomaly Detection
    • DBSCAN
    • Autoencoder
  • Visualization
    • Saliency Maps
    • Fooling images
    • Class Visualization
Powered by GitBook
On this page
  • Parallel Gradient Descent in Decentralized Network
  • Decentralized Network
  • Decentralized Gradient Descent
  • Theories of Decentralized Algorithms
  • Ring All-reduce
  • Reduce VS All-Reduce
  • Reduce
  • All-Reduce
  • Ring All-Reduce
  • Horovod
  • Reference

Was this helpful?

  1. Parallel Computing

Decentralized And Ring All Reduce

PreviousParameter ServerNextFederated Learning

Last updated 3 years ago

Was this helpful?

Parallel Gradient Descent in Decentralized Network

Decentralized Network

Characters: peer-to-peer architecture (no central server), message passing communication, a node communicate with its neighbors.

Decentralized Gradient Descent

Theories of Decentralized Algorithms

  • Decentralized GD and SGD are guaranteed to converge, e.g., .

  • Convergence rate depends on how well the nodes are connected.

    • If the nodes are well connected, then it has fast convergence.

    • If the graph is not strongly connected, then it does not converge.

Ring All-reduce

Reduce VS All-Reduce

Reduce

The server gets the result of reduce (e.g., sum, mean, count.)

All-Reduce

  • Every node gets a copies of the result of reduce.

    • E.g., all-reduce via reduce+broadcast. See figure 1

    • E.g., all-reduce via all-to-all communication. See figure 2

    • E.g., ring all-reduce.

Ring All-Reduce

Native Approach

What is wrong with the naïve approach?

  • Most computer networks are idle.

    • m: number of GPUs.

    • d: number of parameters.

    • b: network bandwidth.

Effective Approach

Then continue update all gpus' gradients with these red blocks.

Comparsion

Horovod

  • We added support for models that fit inside a single server, potentially on multiple GPUs, whereas the original version only supported models that fit on a single GPU.

  • Finally, we made several API improvements inspired by feedback we received from a number of initial users. In particular, we implemented a broadcast operation that enforces consistent initialization of the model on all workers. The new API allowed us to cut down the number of operations a user had to introduce to their single GPU program to four.

Benchmark

A comparison of the images processed per second of the Horovod over plain 25GbE TCP and the Horovod with 25GbE RDMA-capable networking when running a distributed training job over different numbers of NVIDIA Pascal GPUs for Inception V3, ResNet-101 and VGG-16.

Reference

Note: g0+g1g_0+g_1g0​+g1​ is considered as a single gradient and will only be transfered once. It will not transfer g0g_0g0​ and g1g_1g1​ separately.

Then Continue update all gpus with gradient ggg.

Communication time: mdb\frac{md}{b}bmd​. (Ignore latency.)

In addition to being network-optimal, the allreduce approach is much easier to understand and adopt. Users utilize a (MPI) implementation such as to launch all copies of the TensorFlow program. MPI then transparently sets up the distributed infrastructure necessary for workers to communicate with each other. All the user needs to do is modify their program to average gradients using an allreduce() operation.

We replaced the Baidu ring-allreduce implementation with . NCCL is NVIDIA’s library for collective communication that provides a highly optimized version of ring-allreduce. NCCL 2 introduced the ability to run ring-allreduce across multiple machines, enabling us to take advantage of its many performance boosting optimizations.

Since both MPI and NCCL support (RDMA) capable networking (e.g., via or ), we ran additional sets of benchmarking tests using RDMA network cards to determine if they helped us enhance efficiency compared to TCP networking.

Message Passing Interface
Open MPI
NCCL
remote direct memory access
InfiniBand
RDMA over Converged Ethernet
slides
youtube
Decentralized_Gradient_Descent_1
Theories_of_Decentralized_Algorithms_1
reduce_1
allreduce_1
allreduce_2
allreduce_3
allreduce_4
allreduce_5
allreduce_6
allreduce_7
first step
second step
third step
fourth step
allreduce_12