Transformer
Transformer is a Seq2Seq model.
Transformer is not RNN.
Purely based attention and dense layers.
Higher accuracy than RNNs on large datasets.
From Single-Head to Multi-Head
Single-head self-attention can be combined to form a multi-head self-attention.
Single-head attention can be combined to form a multi-head attention.
Concatenating outputs of single-head self-attentions.
Self-Attention Layer + Dense Layer
Stacked Self-Attention Layers
Transformer's Encoder
Transformer is a Seq2Seq model (encoder + decoder).
Transformer’s encoder contains 6 stacked blocks.
1 block ≈ 1 multi-head self-attention layer + 1 dense layer.
Encoder network is a stack of 6 such blocks
Transformer’s Decoder : One Block
Decoder network is a stack of 6 such blocks
Put things together: Transformer
Transformer is Seq2Seq model; it has an encoder and a decoder.
Transformer model is not RNN.
Transformer is based on attention and self-attention.
Transformer outperforms all the state-of-the-art RNN models.
Reference
Last updated