ESC
输入关键词搜索文章
目录

Transformer

Related article

Questions before reading

  1. How does multi-head attention work exactly ?
  2. How self-attention work ?
    1. dot-product between key and query → parameter of value
  3. The time complexity per layer ?
    1. Pasted image 20240331131256.png
    2. $O(n^2\cdot d)$ while Recurrent $O(n\cdot d^2)$, so transformer can't process a very long text.
  4. Loss function ?
  5. How to process the variable-length information ?
  6. The computational resource that they used ?
    1. 3.5 days training on eight GPUs.
  7. Can it function if we want to achieve a model working in real time ?

term explanation

Basic Thought

Machine Translation

Attention mechanism

The techniques used ?

*for example, knowledge distillation, data augmentation...

History

How does this field developed ?

model architecture

 300 # architecture

key points

Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.

Pasted image 20240331113114.png

additive attention and dot-product: dot-product can be implemented using highly optimized matrix multiplication code.

Scale Dot-Product Attention: queries and keys of dimension $d_k$, and values of dimension$ d_v$

Multi-head attention

Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimensionality

Application of attention

  1. encoder-decoder attention layer, queries from previous decoder layer, memory keys and values from encoder
  2. self-attention layer enable each position in the decoder or encoder to attend to all positions in the decoder or encoder including the position of itself
  3. We need to prevent leftward information flow in the decoder to preserve the auto-regressive property. We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.

Position-wise Feed-forward Network

Position Encoding

$$PE_{(pos, 2i)}=\sin(pos/1000^{2i/d_{model}})$$
$$PE_{(pos, 2i+1)}=\cos(pos/1000^{2i/d_{model}})$$
$i$ is the dimension.

hyper-parameter

Multi-head: $h$ (8) dimensions: $d_k$, $d_v$, $d_{model}/h$ (64), $d_{ff}$ (2048) for the fully connected layer

We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

experiment

  1. proper number of attention heads
  2. reducing attention key size $d_k$ hurts model quality

English Constituency Parsing

advantage and disadvantage of the model

Why Self-Attention ?

Pasted image 20240331131256.png

  1. total computational complexity per layer
    1. self-attention layer O(1) sequential operation, while recurrent O(n)
  2. the amount of computation that can be parallelized
    1. sequential operation
  3. path length between long-range dependencies in the network
    1. self-attention O(1)

wiki.en sentence representation: byte-pair, word-piece

Article Writing

Good Expression