ML_101
  • Introduction
  • ML Fundamentals
    • Basics
    • Optimization
    • How to prevent overfitting
    • Linear Algebra
    • Clustering
    • Calculate Parameters in CNN
    • Normalization
    • Confidence Interval
    • Quantization
  • Classical Machine Learning
    • Basics
    • Unsupervised Learning
  • Neural Networks
    • Basics
    • Activation function
    • Different Types of Convolution
    • Resnet
    • Mobilenet
  • Loss
    • L1 and L2 Loss
    • Hinge Loss
    • Cross-Entropy Loss
    • Binary Cross-Entropy Loss
    • Categorical Cross-Entropy Loss
    • (Optional) Focal Loss
    • (Optional) CORAL Loss
  • Computer Vision
    • Two Stage Object Detection
      • Metrics
      • ROI
      • R-CNN
      • Fast RCNN
      • Faster RCNN
      • Mask RCNN
    • One Stage Object Detection
      • FPN
      • YOLO
      • Single Shot MultiBox Detector(SSD)
    • Segmentation
      • Panoptic Segmentation
      • PSPNet
    • FaceNet
    • GAN
    • Imbalance problem in object detection
  • NLP
    • Embedding
    • RNN
    • LSTM
    • LSTM Ext.
    • RNN for text prediction
    • BLEU
    • Seq2Seq
    • Attention
    • Self Attention
    • Attention without RNN
    • Transformer
    • BERT
  • Parallel Computing
    • Communication
    • MapReduce
    • Parameter Server
    • Decentralized And Ring All Reduce
    • Federated Learning
    • Model Parallelism: GPipe
  • Anomaly Detection
    • DBSCAN
    • Autoencoder
  • Visualization
    • Saliency Maps
    • Fooling images
    • Class Visualization
Powered by GitBook
On this page
  • Attention Layer: Attention without RNN
  • Attention layer summary
  • Self-Attention layer summary
  • Summary
  • Reference

Was this helpful?

  1. NLP

Attention without RNN

PreviousSelf AttentionNextTransformer

Last updated 3 years ago

Was this helpful?

Attention Layer: Attention without RNN

Compared with Simple RNN + Attention, it's not using hidden states from RNN to compute Key, Value, Query parameters.

  • We study Seq2Seq model (encoder + decoder).

  • Encoder’s inputs are vectors x1,x2,...,xmx_1,x_2,...,x_mx1​,x2​,...,xm​.

  • Decoder’s inputs are vectors x1′,x2′,...,xt′x'_1,x'_2,...,x'_tx1′​,x2′​,...,xt′​.

  • Keys and values are based on encoder’s inputs x1,x2,...,xmx_1,x_2,...,x_mx1​,x2​,...,xm​.

    • Key: k:i=WKxik_{:i}=W_K x_ik:i​=WK​xi​

    • Value: v:i=WVxiv_{:i}=W_V x_iv:i​=WV​xi​

  • Queries are based on decoder’s inputs x1′,x2′,...,xt′x'_1,x'_2,...,x'_tx1′​,x2′​,...,xt′​

  • Query: q:j=WQxj′q_{:j}=W_Q x'_jq:j​=WQ​xj′​

Attention layer summary

Self-Attention layer summary

Summary

  • Attention was originally developed for Seq2Seq RNN models [1].

  • Self-attention: attention for all the RNN models (not necessarily Seq2Seq models [2].

  • Attention can be used without RNN [3].

  • We learned how to build attention layer and self-attention layer.

Reference

  • Original paper: Vaswani et al. Attention Is All You Need. In NIPS, 2017.

Compute weights: α:j=Softmax(KTq:j)∈Rm\alpha_{:j}=Softmax(K^T q_{:j}) \in R^mα:j​=Softmax(KTq:j​)∈Rm

Context vector: cj=αijv:1+...+αmjv:m=Vα:jc_j=\alpha_{ij} v_{:1} + ... +\alpha_{mj} v_{:m}=V \alpha_{:j}cj​=αij​v:1​+...+αmj​v:m​=Vα:j​

Thus, c:jc_{:j}c:j​is a function of xj′x'_jxj′​and [x1,...xm][x_1,...x_m][x1​,...xm​]

Output of attention layer: C=[c:1,c:2,c:3,...,c:t]C=[c_{:1}, c_{:2},c_{:3},...,c_{:t}]C=[c:1​,c:2​,c:3​,...,c:t​]

Attention layer: C=Attn(X,X′)C=Attn(X,X')C=Attn(X,X′)

Encoder's inputs: X=[x1,x2,...,xm]X=[x_1, x_2,...,x_m]X=[x1​,x2​,...,xm​]

Decoder's inputs: X′=[x1′,x2′,...,xt′]X'=[x'_1,x'_2,...,x'_t]X′=[x1′​,x2′​,...,xt′​]

Parameters: WQ,WK,WVW_Q,W_K, W_VWQ​,WK​,WV​

Self-Attention layer: C=Attn(X,X)C=Attn(X,X)C=Attn(X,X)

Inputs: X=[x1,x2,...,xm]X=[x_1, x_2,...,x_m]X=[x1​,x2​,...,xm​]

Parameters: WQ,WK,WVW_Q,W_K, W_VWQ​,WK​,WV​

youtube