16. Transformers

수업정리/Deep Learning

16. Transformers

서히! 2025. 1. 23. 14:47

Outline

seq2seq에 기반한 Attention의 한계가 있음
1. 병렬 처리가 불가능 → 속도가 느림
2. Long Term Dependency 문제를 완전히 해결 X
  ⇒ Attention이 Long Term Dependency 문제를 어느정도 해결하였으나 결국 RNN 기반이어서 병렬 처리 불가능
Transformer의 특징
- Positional Encoding
- (Masked) Multi-Head Attention
- Scaled Dot-product Attention
- Position-wise Feedforward Network + Residual Connection

Transformer

Encoder Input = source Sentence
Decoder Input = right shifted Target Sentence
- Decoder에도 input을 줌, 토큰 확장
- 번역되기 전, 문장의 context vector를 주입 (context vector는 정보를 압축하고 컴퓨터가 이해할 수 있는 숫자로 이루어진 벡터)
- 앞 단어의 subsequence를 주고, 그 다음 단어를 맞추는 것
Decoder Output = Target Sentence

Positional Encoding

CNN, RNN은 순서(time series)를 알 수 있으나, Transformer는 순서를 알지 못함
특징: 위치 변환을 linear transformation 으로 가능하게 함으로써, attention head 별로 상대적 위치(residual connection을 사용하므로) 에 따른 의미 연관성을 파악할 수 있게 함

CNN

N-gram과 유사
(Hello, my), (my, name) ... 등 이렇게 묶여서 들어가기 때문에 근처에 나타나는 단어를 알 수 있고 순서의 의미를 담음
filter를 자동 학습시킨다는 점에서 차이가 있음

RNN

recurrent unit이 있어서 상태 정보가 저장되고, input과 상태에 의해 출력이 결정됨
Time step t 입력과 출력으로 sequence data에 적합
이전 time step의 출력이 현재 time step 출력에 영향

Multi-Head Attention

Query, Key, Value에 대해 Scaled Dot-Product Attention을 여러번 시행
- 각 Head에서 다른 관점에서 attention을 하기 때문에 기존 하나의 관점에서 attention할 때보다 더 많은 정보를 습득
- 여러 개로 나누어 병렬처리
최종적으로 모든 관점에서의 attention 정보를 이어 붙여서 (concat) 내보냄

Masked Multi-Head Attention

하는 이유

Decoder 부분은 아직 예측하지 않은 부분을 attention 하지 못하기 때문에 앞의 단어들에 대해서만 attention 하기 위함

Scaled Dot-Product Attention

Query(Q), Key(K), Value(V) Attention → Query와 Key의 유사도만큼 attention weight를 줘서 value를 가중합

각 Q 벡터는 모든 K 벡터에 대해 Attention Score를 구함
계산된 점수를 Softmax 함수에 넣어 확률 분포로 변환 (= Attention 분포)
Attention 분포를 구한 뒤에 V 벡터를 가중합하여 context vector를 구함
모든 Q 벡터에 대해서 반복

Self Attention

Query, Key, Value가 모두 동일 (Q = K = V = T (Token Embedding Vector))
자기 자신(T)의 요소들끼리 attention 하는 것 (Attention: 모델의 성능 향상을 위해 문맥에 따라 집중할 단어를 결정하는 방식)
- 문장에서 각 토큰들 간으 유사도 측정
- 즉, 유사도만큼 토큰 표현 벡터에 반영 → context vector

Attention between Encoder and Decoder

문맥을 고려해서 그 다음 단어를 예측
Similarity 계산 → 문맥을 반영한 벡터가 올라가면서 계속 변화가 일어남

기타

seq2seq
- 인코더와 디코더로 이루어져 있는 입력 시퀀스를 다른 형식의 출력 시퀀스로 변환하는 딥러닝 모델
- 인코더: context vector 생성
- 디코더: 인코더가 생성한 conxtext vector를 기반으로 출력 시퀀스를 생성

'수업정리 > Deep Learning' 카테고리의 다른 글

15. Word Embedding (0)	2025.01.23
03. Shallow Neural Network (0)	2025.01.22
01. Quick Introduction & 02. Logistic Regression (0)	2025.01.21

현재글16. Transformers

서히의 우당탕탕 코딩일기

https://github.com/seohee0925

딥러닝, python, 빅분기, programmers, BITAmin, AI, Deep Learning, 빅분기스터디, BDA학회, 코딩테스트, 이기적, BDA, 프로그래머스, 파이썬, 빅데이터학회, 빅데이터 연합동아리, 빅분기_실기, 학회, 영진닷컴, 코테,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

서히의 우당탕탕 코딩일기

16. Transformers

Outline

Transformer

Positional Encoding

CNN

RNN

Multi-Head Attention

Masked Multi-Head Attention

Scaled Dot-Product Attention

Self Attention

Attention between Encoder and Decoder

기타

'수업정리 > Deep Learning' 카테고리의 다른 글

'수업정리/Deep Learning'의 다른글

티스토리툴바

16. Transformers

Outline

Transformer

Positional Encoding

CNN

RNN

Multi-Head Attention

Masked Multi-Head Attention

Scaled Dot-Product Attention

Self Attention

Attention between Encoder and Decoder

기타

'수업정리 > Deep Learning' 카테고리의 다른 글

'수업정리/Deep Learning'의 다른글

관련글

티스토리툴바