Attention Layer: Attention without RNN
Compared with Simple RNN + Attention, it's not using hidden states from RNN to compute Key, Value, Query parameters.
We study Seq2Seq model (encoder + decoder).
Encoder’s inputs are vectors x1​,x2​,...,xm​.
Decoder’s inputs are vectors x1′​,x2′​,...,xt′​.
Keys and values are based on encoder’s inputs x1​,x2​,...,xm​.
Key: k:i​=WK​xi​
Value: v:i​=WV​xi​
Queries are based on decoder’s inputs x1′​,x2′​,...,xt′​
Query: q:j​=WQ​xj′​
Attention layer summary
Self-Attention layer summary
Summary
Attention was originally developed for Seq2Seq RNN models [1].
Self-attention: attention for all the RNN models (not necessarily Seq2Seq models [2].
Attention can be used without RNN [3].
We learned how to build attention layer and self-attention layer.
Reference
Original paper: Vaswani et al. Attention Is All You Need. In NIPS, 2017.
Compute weights: α:j​=Softmax(KTq:j​)∈Rm
Context vector: cj​=αij​v:1​+...+αmj​v:m​=Vα:j​
Thus, c:j​is a function of xj′​and [x1​,...xm​]
Output of attention layer: C=[c:1​,c:2​,c:3​,...,c:t​]
Attention layer: C=Attn(X,X′)
Encoder's inputs: X=[x1​,x2​,...,xm​]
Decoder's inputs: X′=[x1′​,x2′​,...,xt′​]
Parameters: WQ​,WK​,WV​
Self-Attention layer: C=Attn(X,X)
Inputs: X=[x1​,x2​,...,xm​]
Parameters: WQ​,WK​,WV​