Attention without RNN

Attention Layer: Attention without RNN

Compared with Simple RNN + Attention, it's not using hidden states from RNN to compute Key, Value, Query parameters.

  • We study Seq2Seq model (encoder + decoder).

  • Encoder’s inputs are vectors x1,x2,...,xmx_1,x_2,...,x_m.

  • Decoder’s inputs are vectors x1,x2,...,xtx'_1,x'_2,...,x'_t.

  • Keys and values are based on encoder’s inputs x1,x2,...,xmx_1,x_2,...,x_m.

    • Key: k:i=WKxik_{:i}=W_K x_i

    • Value: v:i=WVxiv_{:i}=W_V x_i

  • Queries are based on decoder’s inputs x1,x2,...,xtx'_1,x'_2,...,x'_t

  • Query: q:j=WQxjq_{:j}=W_Q x'_j

  • Compute weights: α:j=Softmax(KTq:j)Rm\alpha_{:j}=Softmax(K^T q_{:j}) \in R^m

  • Context vector: cj=αijv:1+...+αmjv:m=Vα:jc_j=\alpha_{ij} v_{:1} + ... +\alpha_{mj} v_{:m}=V \alpha_{:j}

    • Thus, c:jc_{:j}is a function of xjx'_jand [x1,...xm][x_1,...x_m]

  • Output of attention layer: C=[c:1,c:2,c:3,...,c:t]C=[c_{:1}, c_{:2},c_{:3},...,c_{:t}]

Attention layer summary

  • Attention layer: C=Attn(X,X)C=Attn(X,X')

    • Encoder's inputs: X=[x1,x2,...,xm]X=[x_1, x_2,...,x_m]

    • Decoder's inputs: X=[x1,x2,...,xt]X'=[x'_1,x'_2,...,x'_t]

    • Parameters: WQ,WK,WVW_Q,W_K, W_V

Self-Attention layer summary

  • Self-Attention layer: C=Attn(X,X)C=Attn(X,X)

    • Inputs: X=[x1,x2,...,xm]X=[x_1, x_2,...,x_m]

    • Parameters: WQ,WK,WVW_Q,W_K, W_V

Summary

  • Attention was originally developed for Seq2Seq RNN models [1].

  • Self-attention: attention for all the RNN models (not necessarily Seq2Seq models [2].

  • Attention can be used without RNN [3].

  • We learned how to build attention layer and self-attention layer.

Reference

  • Original paper: Vaswani et al. Attention Is All You Need. In NIPS, 2017.

Last updated