[Paper Review] Transformer: Attention Is All You Need

연구/Natural Language Processing

[Paper Review] Transformer: Attention Is All You Need

서히! 2025. 1. 24. 16:33

무엇을 달성하고자 하고, key elements, 하고 싶은 reference를 항상 명심

논문 요약

논문이 다루는 Task

기존 연구 한계

Framework

실험 및 결과

Analysis

Conclusion

논문 정리

0. Abstract

기존에는 복잡한 recurrent neural network 또는 attention mechanism을 통해 인코더와 디코더를 연결한 convolutional neural network를 사용
→ attention mechanism만을 기반으로 한 Transformer를 제안
두 가지 translation task를 수행한 결과 병렬 처리가 가능하고, train 시간이 현저히 줄어든다는 것이 입증되었다.

1. Introduction

Reducrrent neural network, long short-term memory, gated recurrent neural network는 SOTA로 자리잡음
Recurrent model은 일반적으로 symbol position에 따라 factor computation을 수행
이전 hidden state ht-1과 position t에 대한 function으로 hidden state ht를 생성
→ sequence가 길어질수록 병렬 처리가 어려워지고 batching하는 데에 한계가 생김
→ factorization trick[1]과 conditional computation으로 성능이 많이 향상되었지만 그래도 연속적인 계산의 근본적인 한계가 존재
Attention은 입출력 sequence에서 거리에 관계없이 dependency를 모델링할 수 있게 함 but, attention mechanisim은 recurrent network와 함께 사용한다는 문제점이 있음
⇒ Transformer는 recirremce 대신, 오로지 입출력 간의 global dependency를 도출하는 attention mechanism에만 의존
⇒ 병렬을 허용하고 12시간 train 이후에도 SOTA에 도달

2. Background

Extended Neural GPU, ByteNet, ConvS2S는 모두 convolutional neural network를 기본 구성 요소로 사용하여 모든 입출력 위치에 대해 hidden repesentation을 병렬적으로 계산
→ 입출력 위치 간 신호를 연결하는 데 필요한 연산 수가 위치간 거리와 함께 증가: ConvS2S는 선형적으로, ByteNet은 로그적으로 증가
⇒ Transformer에서는 연산이 상수로 줄어들지만 attention-weighted position을 평균화시킴으로써 resolution 감소 but, Multi-Head Attention로 이를 보완
Self-attention = intra-attention: sequence의 representation을 계산하기 위해 단일 시퀀스의 다른 위치를 연결하는 메커니즘
End-to-End network는 sequence-aligned recurrence 대신 recurrent attention mechanism에 기반하여 더 좋은 성능을 냄
→ Transformer는 sequence aligned RNN이나 convolution 없이 오직 self-attention에만 의존

3. Model Architecture

대부분의 시퀀스 변환 모델은 encoder-decoder 구조를 갖고 있음
- encoder는 (x1,...,xn) → z = (z1,...,zn) 으로 변환
- decoder는 z = (z1,...,zn) → (y1,...,ym) 를 생성
  → 모델은 auto-regressive하게 동작하고 다음 출력을 생성할 때 이전에 생성한 symbol을 추가 input으로 사용

3.1 Encoder and Decoder Stacks

Encoder
- 6개의 동일한 layer로 구성되어 있고, 각 layer에는 두 개의 sub-layer가 있음
  - 첫 번째 sub layer는 multi-head self-attention mechanism
  - 두 번째 sub layer는 position-wise fully connected feed-forward network
    → 각 sub layer에 대해 residual connection[2]을 하고 layer 정규화
    → residual connection을 하기 위해서는 모든 sub layer와 embedding layer는 $d_{model} = 512$ 여야 함
Decoder
- Encoder와 마찬가지로 6개의 레이어와 2개의 sub layer + third layer 추가: encoder stack의 출력에 대해 multi-head attention 수행
- self-attention sub-layer를 수정 → 위치 i에 대한 예측이 i보다 작은 위치에서만 알려진 output만 의존하도록 함

3.2 Attention

Query와 key-value 쌍의 집합을 출력에 매핑
출력값은 value의 weighted sum으로 계산되며, weight는 compatibility function에 의해 계산됨

💡 Attention 부연 설명 [3]

Encoder는 input 데이터를 받아 context vector로 변환 및 출력하는 역할 / Decoder는 압축한 context vector를 입력 받아 output vector를 출력 : 정보를 압축하므로써 연산량을 최소화하기 위함
But, Encoder의 마지막 RNN 셀에서만 context vector가 나오기 때문에 그 전 RNN 셀들은 반영이 되지 않음
→ 정보 손실 문제를 해결하기 위해 Attention이 도입: Encoder는 모든 RNN 셀의 hidden states를 사용하는 반면, Decoder의 경우 현재 RNN 셀의 hidden state만을 사용

Query, Key, Value의 시작값은 다 동일하고 (따라서 Self를 붙임), 중간에 weight W 값에 의해 최종적인 Query, Key, Value가 다르게 됨

3.2.1 Scaled Dot-Product Attention

(1) additive attention: 단일 hidden layer를 가진 feed-forward network를 사용한 compatibility function을 계산
(2) dot-product attention: 훨씬 빠르고 space-efficient
$d_{k}$ : query와 key 차원
- $d_{k}$ 의 값이 커지면 dot product 값이 매우 커져 softmax 함수가 매우 작은 값으로 밀려나가는 것을 막기 위해 dot product를 $\sqrt{d_{k}}$ 로 나누어 크기 조정

3.2.2 Multi-Head Attention

단일 attention function을 수행하는 것보다 query, key, value 각 차원에 대해 병렬적으로 attention 수행
→ concatenate해서 다시 project가 되어 $d_{v}$ 차원의 output value를 산출

논문에서는 $h = 8$의 attention layer (= head)를 사용하고, $d_{k}$, $d_{v}$, $d_{model}$ / $h$ = 64 로 하면 총합 computational cost는 single-head attention과 동일 : 숫자 넣어서 하는 부분이 이해가 잘 안됨

3.2.3 Applications of Attention in our Model

Transformer는 세 가지의 다른 방법으로 multi-head attention을 사용

Encoder-Decoder attention: 이전 decoder layer로부터 query, encoder로부터 key와 value를 사용
→ decoder의 모든 position이 input sequence의 모든 position에 도달 = encoder-decoder attention mechanism, seq2seq model
Encoder에는 self-attention layer가 포함되어 있음: key, value, query는 모두 동일한 이전 인코더 레이어의 출력으로부터 오고, 인코더의 각 위치는 이전 레이어에 모든 위치에 attention을 할 수 있음
Decoder도 Encoder와 마찬가지로 layer가 포함되어 있으며 모든 위치에 도달할 수 있도록 함
단 왼쪽 방향으로의 information flow를 막아야 함 (왜냐하면 디코더는 현재까지의 예측된 단어들만 사용하고, 아직 예측되지 않은 단어들은 참조할 수 없어야 하기 때문)
→ 따라서 -inf로 설정하여 mask out

3.3 Position-wise Feed-Forward Networks

Encoder와 Decoder에서 각 레이어들은 fully connected feed-forward network이고, 두 개의 선형 변환과 ReLU activation으로 구성되어 있음
linear transformation은 다른 파라미터를 사용

3.4 Embeddings and Softmax

다른 시퀀스 변환 모델처럼 input token, output token → $d_{model}$ 차원의 벡터로 변환 + learned linear transformation, softmax 함수를 사용하여 디코더 output → 다음 토큰 확률로 변환
[5]와 유사하게 두 개의 Embedding layers와 pre-softmax linear transfomation 사이에 동일한 weight matrix를 공유
→ Embedding layers 에는 가중치에 $\sqrt{d_{model}}$ 를 곱함

→ (Sequential Operations) Self-Attentoin은 상수 시간이 걸리는 반면, Recurrent는 $O(n)$ 시간이 걸림

→ (Complexity per Layer) Self-Attention layer는 recurrent layer보다 빨리 걸림: sequence 길이인 n이 representation 차원 d보다 작을 경우 (word-piece나 byte-pair와 같은 machine translation SOTA에서 많이 발생)

3.5 Positional Encoding

recurrence, convolution을 사용하지 않기 때문에 sequence의 순서를 사용하기 위해서는 sequence에서 token의 상대적이거나 절대적인 위치를 주입해야 함 = 인코더와 디코더 스택 하단에 positional encoding을 추가

$pos$ : position, $i$ : dimension
고정된 위치 offset $k$ 에 대해 $PE_{pos+k}$가 $PE_{pos}$ 의 선형 함수로 표현될 수 있기 때문에 모델이 상대적인 위치에 대해 쉽게 학습할 수 있을 것이라고 가정
learned positional embeddings를 사용하는 방법도 실험해보았으나 결과는 거으 ㅣ같음

⇒ 더 긴 시퀀스 길이에 대해서도 일반화할 수 있기 때문에 사인 함수 기반의 위치 인코딩을 선택

4. Why Self-Attention

Self-attention을 사용하는 동기

레이어별 total computational complexity
병렬화할 수 있는 computation 양: 최소한의 순차적 연산 횟수로 측정
long-range dependency 간 path length
- forward와 backward signal이 이동해야하는 path length가 영향을 줌
- path length가 짧을수록 long-range dependency를 쉽게 학습

Long Sequence에 대해 계산 성능을 향상시키기 위해 self-attention은 입력 시퀀스 내 크기 $r$의 이웃만 고려하도록 제한 = maximum path length가 $O(n/r)$
커널 너비 $k < n$인 single convolutional layer는 모든 입력 및 출력 위치 쌍을 연결 X
연속적인 커널을 사용할 경우 $O(n/k)$개의 convolutional layer가 필요하고, dilated convolution을 사용할 경우 $O(log_{k}(n))$
→ recurrent layer보다 연산 비용이 $k$배 더 크지만, separable convolution을 사용하면 크게 감소
→ separable convolution의 연산량은 self-attention layer와 point-wise feed-forward layer를 결합한 것과 동일

5. Training

5.1 Training Data and Batching

훈련 데이터
- WMT 2014 English-German dataset: 4.5 million sentence pairs, 문장은 약 37,000개의 토큰으로 구성된 byte-pair encoding 방식으로 인코딩됨
- WMT 2014 English-French dataset: 36M 문장들과 32,000 word-piece vocabulary 로 토큰 분리
각 training batch에는 약 25,000개의 source token과 25,000개의 target token이 포함된 sentence pair가 포함되어 있음

5.2 Hardware and Schedule

8 NVIDIA P100 GPU를 갖춘 하나의 머신에서 모델을 훈련
기본 모델은 0.4초가 소요되었고, 총 100,000 단계(12시간) 훈련
큰 모델은 300,000 딘계(3.5일) 훈련

5.3 Optimizer

Adam Optimizer[6]를 사용, 훈련 과정에서 학습률을 변화
warmup_steps 훈련 단계에서는 learning rate를 선형적으로 증가시키고, 그 이후에는 step number의 역제곱근에 비례하여 감소 (warmup_steps = 4000)

5.4 Regularization

세 종류의 regularization 사용

[Table 2] 이전 SOTA보다 BLEU score 달성 (Transformer(big)은 2를 초과한 28.4를 기록)

Residual Dropout

각 sub-layer의 출력에 dropout을 적용한 후 sub-layer의 input과 더한 뒤 정규화
encoder와 decoder stack에서 embedding과 positional encodding의 합에도 dropout을 적용 (0.3 대신 $P_{drop} = 0.1$)

Label Smoothing

perplexity[8]에는 부정적인 영향이 미치지만 accuracy와 BLEU 점수는 개선됨

6. Results

6.1 Machine Translation

base model도 이전에 나온 model과 ensemble을 능가함
English-to-German: 2보다 큰 28.4 BLEU score
English-to-French: 41.0 BLEU score, 이전 SOTA 모델의 training cost의 1/4, $P_{drop}$을 0.3 대신 0.1 사용 왜 0.1인지 한번 알아보기
base model에서는 10분 간격으로 저장된 5개의 checkpoints를 평균내고, big model에서는 20개의 checkpoints
beam search 사용: beam size=4(4개의 후보 시퀀스), alpha(length penalty)=0.6(긴 문장에 패널티를 적용하여 적절한 길이의 문장을 선호)
maximum output length: input length+50 으로 설정하였으나 가능하면 빨리 끝냄
FLOPs를 추정하기 위해 훈련 시간, GPU 수, 각 GPU가 초당 할 수 있는 단정밀도 처리 용량에 대한 추정치를 곱함

6.2 Model Variations

[Table 3] Variations on the Transformer architecture

(A) : 계산량은 상수로 유지한채, attention head와 key, value dimension을 다르게 함, single-head attention은 best setting보다 0.9 낮음 아마 h=1 부분인 것같으나, 저 영어가 의미하는게 무엇인지 다시 확인해보기

(B) : attention key size인 $d_{k}$를 감소시키는 것은 model의 질을 떨어트림 → compatibility가 쉽지 않음

(C), (D) : 모델이 클수록 더 좋음을 알 수 있으며, dropout은 오버피팅에 매우 효과적

(E) : sinusoidal positional encoding → learned positional embedding, base model과 유사

6.3 English Constituency Parsing

Transformer를 일반화할 수 있는지 평가하기 위해서는 English constituency parsing(구분 성분 분석: 문장을 트리 형태로 분석하여 문법적인 구성 요소를 파악하는 기법)을 수행해야 함
WSJ (Wall Street Journal) 데이터셋(약 40,000개의 training sentences)을 사용하여 $d_{model}=1024$인 4-layer transformer를 학습
+ semi-supervised setting에서도 훈련 (이떄 1700만개의 문장으로부터 비롯된 high-confidence 코펏와 BerkelyParser 코퍼스를 사용)
+ WSJ 데이터만 사용하는 경우 16K token, semi-supervised setting에서는 32K token을 사용
dropout(attention과 residual)을 선택하기 위해 몇 가지 experiment 수행: English-to-German base translation model에서의 파라미터와 동일
WSJ, semi-supervised에서 inference 하는 동안에는 maximum output length는 input length+300, beam size=21, alpha=0.3

→ task-specific tuning이 부족함에도 불구하고 model이 잘 작동함을 알 수 있음 (Recurrent Neural Network Grammar[8] 제외)

→ RNN sequence-to-sequence model[37]과 달리 WSJ training set에서만 훈련할 때에도 Transformer는 Berkel Parser[29]의 성능을 능가함

7. Conclusion

Transformer는 recurrent layer에서 multi-headed self-attention으로 대체한 attention에만 의존하는 최초의 시퀀스 변환 모델
Translation task에서 Transformer는 recurrent나 convolutional layer 기반 아키텍처에서 상당히 빠르게 훈련됨
text 외에도 다른 modality로도 확장 + 큰 입출력을 처리하기 위해 local, restricted attention에 대해 알아볼 예정

기타

SOTA: 'State of the Art'의 약어로, 인공지능(AI) 및 기계 학습(ML) 분야에서 특정 작업에 대해 현재 사용 가능한 최고의 모델 또는 알고리즘
Dropout[7]
임의의 노드를 일정 확률로 드랍해서 학습에 참여하지 않도록 하는 방법
Perplexity: 언어 모델을 평가하기 위한 평가 지표 (PPL이라고도 하며, '헷갈리는 정도'로 이해)
PPL의 수치가 낮을수록 언어 모델의 성능이 좋음을 의미

FLOPs (Floating Point Operations) [9]
- 부동 소수점 연산 (!= FLOPS: 1초당 얼마나 많은 연산을 처리할 수 있느냐하는 하드웨어의 performance 측면)
- 주로 모델의 계산 복잡성을 측정하는데 사용
- 입력에 따라 크기가 변하며 단순히 파이썬 라이브러리 등으로 구하면 오류가 날 수 있기 때문에 공식을 알아두고 검산할 필요가 있음

References

[1] Factorization tricks for LSTM networks: [1703.10722] Factorization tricks for LSTM networks

[2] Deep residual learning for im age recognition

[3] Attention 부연 설명: https://codingopera.tistory.com/41

[4] SOTA 의미: https://wikidocs.net/238152

[5] Using the Output Embedding to Improve Language Models: [1608.05859] Using the Output Embedding to Improve Language Models

[6] Adam optimizer: Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization

[7] Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research

[8] Perplexity 부연 설명: https://wikidocs.net/21697

[9] FLOPs 부연 설명
http://kimbg.tistory.com/26

https://davidlds.tistory.com/35

'연구 > Natural Language Processing' 카테고리의 다른 글

[Paper Review] Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning (5)	2025.07.13
[Paper Review] Entropy-Aware Branching for Improved Mathematical Reasoning (4)	2025.07.06
LLM-Check: Investigating Detection of Hallucinations in Large Language Models (NeurIPS 2024) (0)	2025.05.25
[Paper Review] LLM-Check: Investigating Detection of Hallucinations in Large Language Models (0)	2025.03.24
[Paper Review] GPT1: Improving Language Understanding by Generative Pre-Training (3)	2025.02.05

현재글[Paper Review] Transformer: Attention Is All You Need

서히의 우당탕탕 코딩일기

https://github.com/seohee0925

학회, 코테, BDA, BITAmin, 영진닷컴, python, 딥러닝, AI, 코딩테스트, BDA학회, 이기적, 빅분기스터디, 빅분기, 빅분기_실기, 빅데이터학회, 빅데이터 연합동아리, Deep Learning, programmers, 프로그래머스, 파이썬,

Today :
Yesterday :

서히의 우당탕탕 코딩일기

[Paper Review] Transformer: Attention Is All You Need

논문 요약