Transformer

2024/04/24 00:00:00·2026/05/19 10:23:00

AI架构·12 min read

Transformer 注意力机制深度学习 NLP

BERT
ViT

Questions before reading

How does multi-head attention work exactly ?
How self-attention work ?
1. dot-product between key and query → parameter of value
The time complexity per layer ?
2. $O(n^2\cdot d)$ while Recurrent $O(n\cdot d^2)$ , so transformer can't process a very long text.
Loss function ?
How to process the variable-length information ?
The computational resource that they used ?
1. 3.5 days training on eight GPUs.
Can it function if we want to achieve a model working in real time ?

term explanation

BLEU[^1]
[^1]: wiki.en BLEU

development set[^2]
[^2]: wiki validation set

beam search

Basic Thought

Machine Translation

RNN recurrence Neural network
- long short-term memory
- gate recurrent NN
Default: critical at longer sequence lengths, as memory constraints limit batching across examples.
encoder-decoder architechture
factorization trick
conditional computation

Attention mechanism

modeling of dependencies without regard to their distance in the input or output sequences
Transformer: eschewing recurrence and relying entirely on an attention mechanism

The techniques used ?

*for example, knowledge distillation, data augmentation...

History

How does this field developed ?

wiki.en CNN
- Extended Neural GPU
- ByteNet
- ConvS2S
--> Multi-head Attention, a constant number of operations to learn dependencies between distant positions
Self-attention
End-to-end memory networks

model architecture

300 # architecture

key points

encoder
- the encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z = (z1, ..., zn)
- input wiki.en embedding
  - means
    - word2vec
    - ...
  - to output query, key and value
  - convert the input tokens and output tokens to vectors of dimension $d_{model}$
  - shared weights between two embedding layers
- multi-head attention
  - residual connection
- layerNorm
  - $$LayerNorm(x + Sublayer(x))$$
- position-wise fully connected feed-forward network
  - just a fully connected layer
- layerNorm
decoder
- Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time.
- Masked multi-head attention
  - This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.

Pasted image 20240331113114.png

additive attention and dot-product: dot-product can be implemented using highly optimized matrix multiplication code.

Scale Dot-Product Attention: queries and keys of dimension $$d_k$$ , and values of dimension $$ d_v$$

Multi-head attention

Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality

Application of attention

encoder-decoder attention layer, queries from previous decoder layer, memory keys and values from encoder
self-attention layer enable each position in the decoder or encoder to attend to all positions in the decoder or encoder including the position of itself
We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.

Position-wise Feed-forward Network

Position Encoding

PE_{(pos, 2i)}=\sin(pos/1000^{2i/d_{model}})

PE_{(pos, 2i+1)}=\cos(pos/1000^{2i/d_{model}})

$i$

is the dimension.

hyper-parameter

Multi-head: $$h$$ (8) dimensions: $$d_k$$ , $$d_v$$ , $d_{model}/h$ (64), $d_{ff}$ (2048) for the fully connected layer

We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

experiment

WMT 2014 English-German D-datasets machine translation
WMT 2014 English-to-French D-datasets machine translation
one machine with 8 NVIDIA P100 GPUs
base model
- each training step took about 0.4 seconds
- 100,000 steps or 12 hours
big model
- step time was 1.0 seconds
- trained for 300,000 steps (3.5 days)
adam optimizer 方法
- $\beta_1=0.9$ , $\beta_2=0.98$ , $\epsilon=10^{-9}$
- varied learning rate 方法
regularization 方法
- residual dropout 0.1
- label smoothing 0.1
beam search 方法
- beam size 4
- length penalty 0.6
averaing the last x checkpoints 方法
development set newstest2013 D-datasets
We set the maximum output length during inference to input length + 50, but terminate early when possible

proper number of attention heads
reducing attention key size $$d_k$$ hurts model quality

English Constituency Parsing

advantage and disadvantage of the model

Why Self-Attention ?

Pasted image 20240331131256.png

total computational complexity per layer
1. self-attention layer O(1) sequential operation, while recurrent O(n)
the amount of computation that can be parallelized
1. sequential operation
path length between long-range dependencies in the network
1. self-attention O(1)

wiki.en sentence representation: byte-pair, word-piece

Article Writing

Good Expression

To the best of our knowledge
dispense
eschew
as side benefit
yield
- self-attention could yield more interpretable models