Attention
Shortcoming of Seq2Seq
The final state is incapable of remembering a long sequence

Seq2Seq Model with Attention
Attention tremendously improves Seq2Seq model.
With attention, Seq2Seq model does not forget source input.
With attention, the decoder knows where to focus.
Downside: much more computation.
Simple RNN + Attention

There are two options to calculate weight:
Option1: (Used in original paper)

Then normalize (so that they sum to 1):
Option2: More popular; the same to Transformer
Linear maps:
, for i = 1 to m
Inner product:
, for i = 1 to m
Normalization
Calculate the next state
Simple RNN:
Simple RNN + Attention:
Context vector:

For next state , do not use the previously calculated
Compute parameters For state
Query: (To match others)
Key: (To be matched)
Value: (To be weighted averaged)
Weights:
Compute and
Compute weights:

Context vector:

Time complexity
Question: How many weights have been computed?
To compute one vector , we compute weights: .
• The decoder has states, so there are totally weights.
Weights Visualization

Summary
Standard Seq2Seq model: the decoder looks at only its current state.
Attention: decoder additionally looks at all the states of the encoder.
Attention: decoder knows where to focus.
Downside: higher time complexity.
: source sequence length
: target sequence length
Standard Seq2Seq: time complexity
Seq2Seq + attention: time complexity
References:
Bahdanau, Cho, & Bengio. Neural machine translation by jointly learning to align and translate.
In ICLR, 2015.
Last updated
Was this helpful?