Transformer
- Transformer is a Seq2Seq model. 
- Transformer is not RNN. 
- Purely based attention and dense layers. 
- Higher accuracy than RNNs on large datasets. 
From Single-Head to Multi-Head
- Single-head self-attention can be combined to form a multi-head self-attention. 
- Single-head attention can be combined to form a multi-head attention. 
- Using single-head self-attentions (which do not share parameters.) - A single-head self-attention has 3 parameter matrices: 
- Totally parameters matrices. 
 
- Concatenating outputs of single-head self-attentions. - Suppose single-head self-attentions' outputs are matrices. - • Multi-head’s output shape: 
 

Self-Attention Layer + Dense Layer

Stacked Self-Attention Layers

Transformer's Encoder

- Transformer is a Seq2Seq model (encoder + decoder). 
- Encoder’s inputs are vectors . 
- Decoder’s inputs are vectors . 
- Transformer’s encoder contains 6 stacked blocks. 
- 1 block ≈ 1 multi-head self-attention layer + 1 dense layer. - Input shape: 
- Output shape: 
 
- Encoder network is a stack of 6 such blocks 
Transformer’s Decoder : One Block


- 1 decoder block multi-head self-attention + multi-head attention + dense - Input shape: 
- Output shape: 
 
- Decoder network is a stack of 6 such blocks 
Put things together: Transformer

- Transformer is Seq2Seq model; it has an encoder and a decoder. 
- Transformer model is not RNN. 
- Transformer is based on attention and self-attention. 
- Transformer outperforms all the state-of-the-art RNN models. 
Reference
Last updated
Was this helpful?