Transformer
Related article
Questions before reading
- How does multi-head attention work exactly ?
- How self-attention work ?
- dot-product between key and query → parameter of value
- The time complexity per layer ?
- $O(n^2\cdot d)$ while Recurrent $O(n\cdot d^2)$, so transformer can't process a very long text.
- Loss function ?
- How to process the variable-length information ?
- The computational resource that they used ?
- 3.5 days training on eight GPUs.
- Can it function if we want to achieve a model working in real time ?
term explanation
- BLEU[^1]
[^1]: wiki.en BLEU
- development set[^2]
[^2]: wiki validation set
- beam search
Basic Thought
Machine Translation
- RNN recurrence Neural network
- long short-term memory
- gate recurrent NN
- Default: critical at longer sequence lengths, as memory constraints limit batching across examples.
- encoder-decoder architechture
- factorization trick
- conditional computation
Attention mechanism
- modeling of dependencies without regard to their distance in the input or output sequences
- Transformer: eschewing recurrence and relying entirely on an attention mechanism
The techniques used ?
*for example, knowledge distillation, data augmentation...
History
How does this field developed ?
- wiki.en CNN
- Extended Neural GPU
- ByteNet
- ConvS2S
- --> Multi-head Attention, a constant number of operations to learn dependencies between distant positions
- Self-attention
- End-to-end memory networks
model architecture
key points
- encoder
- the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn)
- input wiki.en embedding
- means
- word2vec
- ...
- to output query, key and value
- convert the input tokens and output tokens to vectors of dimension $d_{model}$
- shared weights between two embedding layers
- means
- multi-head attention
- residual connection
- layerNorm
- $LayerNorm(x + Sublayer(x))$
- position-wise fully connected feed-forward network
- just a fully connected layer
- layerNorm
- decoder
- Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time.
- Masked multi-head attention
- This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.
Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.
additive attention and dot-product: dot-product can be implemented using highly optimized matrix multiplication code.
Scale Dot-Product Attention: queries and keys of dimension $d_k$, and values of dimension$ d_v$
Multi-head attention
Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality
Application of attention
- encoder-decoder attention layer, queries from previous decoder layer, memory keys and values from encoder
- self-attention layer enable each position in the decoder or encoder to attend to all positions in the decoder or encoder including the position of itself
- We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.
Position-wise Feed-forward Network
Position Encoding
hyper-parameter
Multi-head: $h$ (8) dimensions: $d_k$, $d_v$, $d_{model}/h$ (64), $d_{ff}$ (2048) for the fully connected layer
We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.
experiment
- WMT 2014 English-German D-datasets machine translation
- WMT 2014 English-to-French D-datasets machine translation
- one machine with 8 NVIDIA P100 GPUs
- base model
- each training step took about 0.4 seconds
- 100,000 steps or 12 hours
- big model
- step time was 1.0 seconds
- trained for 300,000 steps (3.5 days)
- adam optimizer 方法
- $\beta_1=0.9$, $\beta_2=0.98$, $\epsilon=10^{-9}$
- varied learning rate 方法
- regularization 方法
- residual dropout 0.1
- label smoothing 0.1
- beam search 方法
- beam size 4
- length penalty 0.6
- averaing the last x checkpoints 方法
- development set newstest2013 D-datasets
- We set the maximum output length during inference to input length + 50, but terminate early when possible
- proper number of attention heads
- reducing attention key size $d_k$ hurts model quality
English Constituency Parsing
advantage and disadvantage of the model
Why Self-Attention ?
- total computational complexity per layer
- self-attention layer O(1) sequential operation, while recurrent O(n)
- the amount of computation that can be parallelized
- sequential operation
- path length between long-range dependencies in the network
- self-attention O(1)
wiki.en sentence representation: byte-pair, word-piece
Article Writing
Good Expression
- To the best of our knowledge
- dispense
- eschew
- as side benefit
- yield
- self-attention could yield more interpretable models