[Paper Review] Neural Machine Translation by Jointly Learning to Align and Translate (Attention, 2015)

Outlines

Reference
Attention for RNN Encoder-Decoder Networks
Issue of Interest
Model Architectures of BiRNN with Attention
- 1. Encoder
- 2. Decoder
BiRNN Ecoder-Decoder with Attention Mechanism Summary

Reference

Bahdanau, Cho, & Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
Attention for RNN Seq2Seq Models

Attention for RNN Encoder-Decoder Networks

This paper proposes a novel approach called “Attention” to improve the performance of machine translation using encoder-decoder (Seq2Seq) architeture.

Encoder-decoders refers to a system where the model encodes a source sentence into a fixed-length vector from which the decoder outputs a translation corresponding to the given source sentence.

Basic encoder-decoder network has limited performance on the translation of long sentences and this paper successfully mitigates the issue by introducing the concept of “attention” that allows the model to automatically focus on the information relevant to the predicting target word.

Issue of Interest

The underlying cause behind the poor performance of original encoder-decoder network mainly lies on the fact that the encoder needs to compress the source sentence, regardless of its original legnth, into a fixed-length vector.

The encoder takes a variable-length input and transforms it into a state with a fixed shape and the decoder maps the encoded fixed shaped vector into again, variable-length translated output.

This is because in a basic RNN encoder-decoder framework, decoder uses a context vector as it’s initial input and it is computed from the final hidden state of encoder, which is typically a fixed-length vector.

Use of this fixed length of context vector acts as an information bottleneck in a sense that as the length of the source sentences increases, more information needs to be squashed and packed into that fixed length context vector, which results in the loss of detailed or possibly important information of the original source input.

This can be shown in the Figure 2. presented above where the BLEU score of the model with basic encoder tends to decrease as the length of source sentence increases.

Model Architectures of BiRNN with Attention

The most common encoder-decoder framework used in machine translation is RNN. Here, This is the detailed architecture of proposed attention RNN model (RNNsearch) used in the paper.

(image from https://www.youtube.com/watch?v=S2msiG9g7Us )

1. Encoder

First, the model takes source sentences at each time step as input, and compute the forward and backward states of them.

Input (Source Sentence) & Output (Translation) :

$\large x = (x_1, \ldots, x_{T_x}), \quad x_i \in \mathbb{R}^{K_x}$
$\large y = (y_1, \ldots, y_{T_y}), \quad y_i \in \mathbb{R}^{K_y}$

$T_{x}$ and $T_{y}$ respectively denote the lengths of source and target sentences.

Bidirectional RNN (BiRNN) Model :

$\large E \in \mathbb{R}^{m \times K_x}$ : word embedding matrix

$\large W, \vec{W}_z, \vec{W}_r \in \mathbb{R}^{n \times m}$ : weight matrices where m denotes the embedding dimensionality

$\large \vec{U}, \vec{U}_z, \vec{U}_r \in \mathbb{R}^{n \times n}$ : weight matrices where n denotes the number of hidden units

Repeat the same step backwards to get backward states of input. (embedding matrix is shared unlike the weight matrices)
vertically Concatenate the forward and backward states into one complete hidden states matrix.

$\large h_i = \begin{bmatrix} \overrightarrow{h}_i &
\overleftarrow{h}_i \end{bmatrix}^\intercal$

2. Decoder

Alignment Model

Additive Attention
- $\large a(s_{i−1}, h_{j}) = v_{a}^{T} \, tanh (W_{a}\,s_{i−1} + U_{a}\,h_{j})$
Dot-Product Attention (Different from what’s suggested in the paper, but more generally used)
- $\large k_{i}\,=\, W_{K} \times h_{i}$ (for i = 1 to m, m : number of hidden states in encoder, $W_{K}\,\in\,\mathbb{R}^{n \times 2n} $)
- $\large q_{j}\,=\, W_{Q} \times s_{j}$ ($W_{K}\,\in\,\mathbb{R}^{n \times n} $)
- Take inner product with $k_{i}$ and $q_{j}$ and normalize it so that $\alpha_{i}$ adds up to 1
  - Search for a set of positions ($i$) in a source sentence that is most relevant to hidden state($s_{j}$) of current predicting word.
  - $\large \alpha_{i} = Softmax(k_{i}.dot(q_{j}))$ (for i = 1 to m)
  - $\large \alpha_{i}$ represents how much each hidden state of source sentence contributes to predicting the translation in the decoder.
- $\large \alpha_{i}$ = align($h_{i}$, $s_{j}$) = $\large \frac{\exp(k_i^\top q)}{\sum_{j=1}^{n} \exp(k_j^\top q)}$
Create Context Vector using $\large \alpha_{i}$
- $\large c_{j}\,=\,\alpha_{1}\,h_{1} + \alpha_{2}\,h_{2} + \ldots + \alpha_{m}\,h_{m}$ = $\large \sum\limits_{i=1}^{m}\,\alpha_{ji}\,h_{i}$
Compute Hidden states $\large s_{i}$ of decoder with context vector $\large c_{i}$

where $\large C_{i}\,=\, \sum\limits_{j=1}^{m}\,\alpha_{ij}\,h_{j}$
- $W,\,W_{z},\,W_{r} \in \mathbb{R}^{n \times m}$, $U,\,U_{z},\,U_{r} \in \mathbb{R}^{n \times n}$, $C,\,C_{z},\,C_{r} \in \mathbb{R}^{n \times 2n}$

BiRNN Ecoder-Decoder with Attention Mechanism Summary

By introducing attention method to basic RNN encoder-decoder framework, the limitation in translation performance on long sentences can be addressed by allowing dynamic search of different parts of the input sequence.

Fine-tuned context vector with attention will free the network from having to compress a whole soure sentence equally into a fixed-length vector and let the model to only focus on the input information relevant to generation of next target word.

Screen Shot 2023-05-23 at 9 24 26 PM