Transformer
Last updated
Last updated
Transformer is a Seq2Seq model.
Transformer is not RNN.
Purely based attention and dense layers.
Higher accuracy than RNNs on large datasets.
Single-head self-attention can be combined to form a multi-head self-attention.
Single-head attention can be combined to form a multi-head attention.
Using single-head self-attentions (which do not share parameters.)
A single-head self-attention has 3 parameter matrices:
Totally parameters matrices.
Concatenating outputs of single-head self-attentions.
Suppose single-head self-attentions' outputs are matrices.
• Multi-head’s output shape:
Transformer is a Seq2Seq model (encoder + decoder).
Transformer’s encoder contains 6 stacked blocks.
1 block ≈ 1 multi-head self-attention layer + 1 dense layer.
Encoder network is a stack of 6 such blocks
Decoder network is a stack of 6 such blocks
Transformer is Seq2Seq model; it has an encoder and a decoder.
Transformer model is not RNN.
Transformer is based on attention and self-attention.
Transformer outperforms all the state-of-the-art RNN models.
Encoder’s inputs are vectors .
Decoder’s inputs are vectors .
Input shape:
Output shape:
1 decoder block multi-head self-attention + multi-head attention + dense
Input shape:
Output shape: