Attention without RNN
Last updated
Last updated
Compared with Simple RNN + Attention, it's not using hidden states from RNN to compute Key, Value, Query parameters.
We study Seq2Seq model (encoder + decoder).
Encoder’s inputs are vectors .
Decoder’s inputs are vectors .
Keys and values are based on encoder’s inputs .
Key:
Value:
Queries are based on decoder’s inputs
Query:
Compute weights:
Context vector:
Thus, is a function of and
Output of attention layer:
Attention layer:
Encoder's inputs:
Decoder's inputs:
Parameters:
Self-Attention layer:
Inputs:
Parameters:
Attention was originally developed for Seq2Seq RNN models [1].
Self-attention: attention for all the RNN models (not necessarily Seq2Seq models [2].
Attention can be used without RNN [3].
We learned how to build attention layer and self-attention layer.
Original paper: Vaswani et al. Attention Is All You Need. In NIPS, 2017.