Attention
Shortcoming of Seq2Seq
The final state is incapable of remembering a long sequence

Seq2Seq Model with Attention
- Attention tremendously improves Seq2Seq model. 
- With attention, Seq2Seq model does not forget source input. 
- With attention, the decoder knows where to focus. 
- Downside: much more computation. 
Simple RNN + Attention

There are two options to calculate weight:
Option1: (Used in original paper)

Then normalize (so that they sum to 1):
Option2: More popular; the same to Transformer
- Linear maps: - , for i = 1 to m 
 
- Inner product: - , for i = 1 to m 
 
- Normalization 
Calculate the next state
Simple RNN:
Simple RNN + Attention:
Context vector:

For next state , do not use the previously calculated
Compute parameters For state
- Query: (To match others) 
- Key: (To be matched) 
- Value: (To be weighted averaged) 
- Weights: - Compute and 
- Compute weights: 
 

- Context vector: 

Time complexity
- Question: How many weights have been computed? - To compute one vector , we compute weights: . - • The decoder has states, so there are totally weights. 
 
Weights Visualization

Summary
- Standard Seq2Seq model: the decoder looks at only its current state. 
- Attention: decoder additionally looks at all the states of the encoder. 
- Attention: decoder knows where to focus. 
- Downside: higher time complexity. - : source sequence length 
- : target sequence length 
- Standard Seq2Seq: time complexity 
- Seq2Seq + attention: time complexity 
 
References:
- Bahdanau, Cho, & Bengio. Neural machine translation by jointly learning to align and translate. - In ICLR, 2015. 
Last updated
Was this helpful?