Attention
Shortcoming of Seq2Seq
The final state is incapable of remembering a long sequence
Seq2Seq Model with Attention
Attention tremendously improves Seq2Seq model.
With attention, Seq2Seq model does not forget source input.
With attention, the decoder knows where to focus.
Downside: much more computation.
Simple RNN + Attention
Option1: (Used in original paper)
Option2: More popular; the same to Transformer
Linear maps:
Inner product:
Normalization
Calculate the next state
Time complexity
Weights Visualization
Summary
Standard Seq2Seq model: the decoder looks at only its current state.
Attention: decoder additionally looks at all the states of the encoder.
Attention: decoder knows where to focus.
Downside: higher time complexity.
References:
Bahdanau, Cho, & Bengio. Neural machine translation by jointly learning to align and translate.
In ICLR, 2015.
Last updated