Transformer

  • Transformer is a Seq2Seq model.

  • Transformer is not RNN.

  • Purely based attention and dense layers.

  • Higher accuracy than RNNs on large datasets.

From Single-Head to Multi-Head

  • Single-head self-attention can be combined to form a multi-head self-attention.

  • Single-head attention can be combined to form a multi-head attention.

  • Concatenating outputs of single-head self-attentions.

Self-Attention Layer + Dense Layer

Stacked Self-Attention Layers

Transformer's Encoder

  • Transformer is a Seq2Seq model (encoder + decoder).

  • Transformer’s encoder contains 6 stacked blocks.

  • 1 block ≈ 1 multi-head self-attention layer + 1 dense layer.

  • Encoder network is a stack of 6 such blocks

Transformer’s Decoder : One Block

  • Decoder network is a stack of 6 such blocks

Put things together: Transformer

  • Transformer is Seq2Seq model; it has an encoder and a decoder.

  • Transformer model is not RNN.

  • Transformer is based on attention and self-attention.

  • Transformer outperforms all the state-of-the-art RNN models.

Reference

youtube

Last updated