Attention Layer: Attention without RNN
Compared with Simple RNN + Attention, it's not using hidden states from RNN to compute Key, Value, Query parameters.
We study Seq2Seq model (encoder + decoder).
Encoder’s inputs are vectors x1,x2,...,xm.
Decoder’s inputs are vectors x1′,x2′,...,xt′.
Keys and values are based on encoder’s inputs x1,x2,...,xm.
Key: k:i=WKxi
Value: v:i=WVxi
Queries are based on decoder’s inputs x1′,x2′,...,xt′
Query: q:j=WQxj′
Attention layer summary
Self-Attention layer summary
Summary
Attention was originally developed for Seq2Seq RNN models [1].
Self-attention: attention for all the RNN models (not necessarily Seq2Seq models [2].
Attention can be used without RNN [3].
We learned how to build attention layer and self-attention layer.
Reference
Original paper: Vaswani et al. Attention Is All You Need. In NIPS, 2017.
Compute weights: α:j=Softmax(KTq:j)∈Rm
Context vector: cj=αijv:1+...+αmjv:m=Vα:j
Thus, c:jis a function of xj′and [x1,...xm]
Output of attention layer: C=[c:1,c:2,c:3,...,c:t]
Attention layer: C=Attn(X,X′)
Encoder's inputs: X=[x1,x2,...,xm]
Decoder's inputs: X′=[x1′,x2′,...,xt′]
Parameters: WQ,WK,WV
Self-Attention layer: C=Attn(X,X)
Inputs: X=[x1,x2,...,xm]
Parameters: WQ,WK,WV