ML_101
  • Introduction
  • ML Fundamentals
    • Basics
    • Optimization
    • How to prevent overfitting
    • Linear Algebra
    • Clustering
    • Calculate Parameters in CNN
    • Normalization
    • Confidence Interval
    • Quantization
  • Classical Machine Learning
    • Basics
    • Unsupervised Learning
  • Neural Networks
    • Basics
    • Activation function
    • Different Types of Convolution
    • Resnet
    • Mobilenet
  • Loss
    • L1 and L2 Loss
    • Hinge Loss
    • Cross-Entropy Loss
    • Binary Cross-Entropy Loss
    • Categorical Cross-Entropy Loss
    • (Optional) Focal Loss
    • (Optional) CORAL Loss
  • Computer Vision
    • Two Stage Object Detection
      • Metrics
      • ROI
      • R-CNN
      • Fast RCNN
      • Faster RCNN
      • Mask RCNN
    • One Stage Object Detection
      • FPN
      • YOLO
      • Single Shot MultiBox Detector(SSD)
    • Segmentation
      • Panoptic Segmentation
      • PSPNet
    • FaceNet
    • GAN
    • Imbalance problem in object detection
  • NLP
    • Embedding
    • RNN
    • LSTM
    • LSTM Ext.
    • RNN for text prediction
    • BLEU
    • Seq2Seq
    • Attention
    • Self Attention
    • Attention without RNN
    • Transformer
    • BERT
  • Parallel Computing
    • Communication
    • MapReduce
    • Parameter Server
    • Decentralized And Ring All Reduce
    • Federated Learning
    • Model Parallelism: GPipe
  • Anomaly Detection
    • DBSCAN
    • Autoencoder
  • Visualization
    • Saliency Maps
    • Fooling images
    • Class Visualization
Powered by GitBook
On this page
  • From Single-Head to Multi-Head
  • Self-Attention Layer + Dense Layer
  • Stacked Self-Attention Layers
  • Transformer's Encoder
  • Transformer’s Decoder : One Block
  • Put things together: Transformer
  • Reference

Was this helpful?

  1. NLP

Transformer

PreviousAttention without RNNNextBERT

Last updated 3 years ago

Was this helpful?

  • Transformer is a Seq2Seq model.

  • Transformer is not RNN.

  • Purely based attention and dense layers.

  • Higher accuracy than RNNs on large datasets.

From Single-Head to Multi-Head

  • Single-head self-attention can be combined to form a multi-head self-attention.

  • Single-head attention can be combined to form a multi-head attention.

  • Using lll single-head self-attentions (which do not share parameters.)

    • A single-head self-attention has 3 parameter matrices: WQ,WK,WVW_Q, W_K, W_VWQ​,WK​,WV​

    • Totally 3l3l3l parameters matrices.

  • Concatenating outputs of single-head self-attentions.

    • Suppose single-head self-attentions' outputs are d×md \times md×m matrices.

      • Multi-head’s output shape: (ld)×m(ld) \times m(ld)×m

Self-Attention Layer + Dense Layer

Stacked Self-Attention Layers

Transformer's Encoder

  • Transformer is a Seq2Seq model (encoder + decoder).

  • Transformer’s encoder contains 6 stacked blocks.

  • 1 block ≈ 1 multi-head self-attention layer + 1 dense layer.

  • Encoder network is a stack of 6 such blocks

Transformer’s Decoder : One Block

  • Decoder network is a stack of 6 such blocks

Put things together: Transformer

  • Transformer is Seq2Seq model; it has an encoder and a decoder.

  • Transformer model is not RNN.

  • Transformer is based on attention and self-attention.

  • Transformer outperforms all the state-of-the-art RNN models.

Reference

Encoder’s inputs are vectors x1,x2,...,xmx_1,x_2,...,x_mx1​,x2​,...,xm​.

Decoder’s inputs are vectors x1′,x2′,...,xt′x'_1,x'_2,...,x'_tx1′​,x2′​,...,xt′​.

Input shape: 512×m512 \times m512×m

Output shape: 512×m512 \times m512×m

1 decoder block ≈\approx≈multi-head self-attention + multi-head attention + dense

Input shape: (512×m,512×t)(512 \times m, 512 \times t)(512×m,512×t)

Output shape: 512×t512 \times t512×t

youtube