[Paper Review] Unveiling Super Experts in Mixture-of-Experts Large Language Models

연구/Natural Language Processing

[Paper Review] Unveiling Super Experts in Mixture-of-Experts Large Language Models

서히! 2025. 8. 17. 22:18

Unveiling Super Experts in Mixture-of-Experts Large Language Models

Sparsely activated Mixture-of-Experts (MoE) models have shown promise in enhancing the learning capacity of large language models (LLMs). Leveraging the intrinsic importance differences among experts, recent research has explored expert-level compression t

arxiv.org

Abstract

Sparse MoE LLM은 dense LLM 대비 계산 효율 좋음
문제는 전문가(Experts) 수가 많아 전체 파라미터 수가 너무 커짐 → 실제 서비스 적용 어려움
기존 Expert-level compression 연구들 있음 (Expert pruning, Expert quantization, Expert merging 등)
하지만 이 연구들은 router frequency, gate score 같은 경험적 지표에 의존
Experts 간 heterogeneous importance 메커니즘에 대한 깊은 분석은 부족함
본 논문에서 새로운 전문가 집합 발견 → Super Experts (SEs) 라 명명
주요 특징
- Down-proj 단계에서 드문 activation outlier 발생
- Residual connection 통해 hidden states 전체에 propagation → massive activations 형성
- 이 activation은 모델의 attention sink 메커니즘과 직결
실험 결과
- SE 제거 시 PPL 급증, reasoning 성능 완전 붕괴
- 단 3개 SE만 제거해도 Qwen3-30B-A3B 모델은 무의미한 반복 출력 생성
기여
1. MoE LLM 내 SE 최초 발견 및 자동 profiling 도구 제안
2. SE 제거가 성능과 attention sink 메커니즘에 치명적임을 입증
3. MoE 모델 압축 시 SE 반드시 보존 필요

1. Introduction

MoE LLM은 dynamic routing + sparse activation 구조로 dense 모델 대비 높은 학습 용량 가짐
대표적 모델: Qwen 시리즈, DeepSeek 시리즈, Mixtral, LLaMA-4 MoE 등
장점: 특정 입력마다 일부 experts만 활성화되어 계산 효율 개선
단점: 전문가 수 자체가 많아 파라미터 수 폭발 → 추론 비용 크고 배포 어려움
기존 연구: Expert-level compression 기법
- Expert pruning (빈도 기반, 중요도 기반)
- Expert merging (비슷한 전문가 병합)
- Expert skipping (활성화 적은 전문가 무시)
- Expert quantization (더 중요한 expert에 높은 비트 할당)
한계
- 경험적 기준에 불과 → 모델 내 특정 전문가 집합이 근본적으로 필수인지에 대한 메커니즘적 분석 없음
핵심 질문
- MoE LLM 내에 정말로 없어서는 안 되는 전문가 집합이 존재하는가?
본 논문의 발견
- 존재함 → 이를 Super Experts(SEs) 라 명명
- SE는 매우 소수이나, 제거 시 성능 붕괴
- 예: Qwen3-30B-A3B에서 SE 3개 제거 시 PPL 8.7 → 59.8 급등
- 모델 출력이 “the way it’s, the way it’s …” 같은 반복 문장으로 붕괴됨

2. Preliminaries on MoE LLMs

MoE LLM = Transformer decoder 기반 구조
각 decoder block = MHSA + MoE layer
MoE layer: 여러 experts(FFN) 중 router가 Top-k 선택
Router는 softmax 기반으로 각 expert에 weight 할당

Decoder block 수식

Router 수식

MoE layer 출력 수식

FFN 수식

3. Super Experts: Discovery and Localization

3.1 Discovery of SEs

기존 dense LLM 연구에서 “massive activations”라는 outlier 현상 보고됨 (값이 다른 activation 대비 10^5배 이상 큼)
본 연구에서는 MoE 모델에서도 이 현상이 발견됨
원인: 전체 experts가 아니라 특정 experts에서만 발생
Down-proj에서 극단적 outlier 생성 → residual 통해 다음 레이어 hidden states에 전달 → 전체 레이어로 확산
Ablation 결과: SE 제거 → massive activations 완전히 사라짐
따라서 massive activation의 기원은 SE임

Qwen3-30B-A3B의 특정 SE들이 massive activation 점화

3.2 Localization of SEs

SE 정의 기준 (activation 기반) 제안

자동 profiling 도구 개발 ( https://github.com/ZunhaiSu/Super-ExpertsProfilling )
Qwen, DeepSeek, Mixtral 모두 SE 보유
주요 발견
- SE 비율 ≤ 0.5%
- Base 모델과 Fine-tuned 모델 비교 → SE 분포 동일
- 데이터셋 달라져도 SE 분포 안정적
Heatmap 분석: SE는 특정 layer에 집중됨 (예: Qwen은 1~3 layer, Mixtral은 1 layer)

Heatmap: 각 expert down-proj 출력, SE 강조 표시

4. The Importance of Super Experts

4.1 Non-Reasoning Models

평가 대상: Qwen3-30B-A3B (non-thinking), DeepSeek-V2-Lite, Mixtral-8x7B
평가 데이터셋: ARC-c, ARC-e, BoolQ, GSM8K, HellaSwag, MMLU, OpenBookQA, PIQA, WinoGrande
결과
- SE pruning → 평균 성능 20~27% 감소
- GSM8K에서 52~74% 성능 급락
- 랜덤 pruning은 거의 영향 없음

4.2 Reasoning Models

평가 대상: DeepSeek-R1, Qwen3-30B-A3B (thinking mode)
벤치마크: GPQA, Math-500, AIME 2024/25, HumanEval, LiveCodeBench
결과
- SE 제거 → Pass@1 거의 0%
- Math-500 같은 수학 문제에서 무한 반복 출력 현상 발생
예시: “the way it’s, the way it’s …” 식의 반복만 출력

Table4: : Evaluation of the importance of SEs in DeepSeek-R1

Table 5: Evaluation of the importance of SEs in Qwen3-30B-A3B

5. Understanding the Impact of SE Compression

기존 dense LLM 연구: massive activations → attention sink 형성
Attention sink: 의미 없는 토큰이 disproportionate attention 받는 현상
본 연구: SE 제거 시 attention sink 붕괴 확인
Attention Sink Decay Rate 제안

결과: SE 제거 후 decay rate 90% 이상 → sink 완전 붕괴
Attention sink 사라지면 attention 분포 무너지고 전역 정보 전달 불능

6. Related Work

기존 연구 요약
- Expert merging (M-SMoE)
- Expert pruning/skipping (NAEE, MoE-Pruner 등)
- Expert-level quantization (MxMoE, MoEQuant 등)
대부분 empirical 기준 (빈도, router score, gradient 등)만 사용
본 연구 차별점
- SE 발견 → MoE inference 메커니즘과 직접 연결
- compression이 성능 붕괴로 이어지는 이유를 설명할 수 있는 기반 제공

7. Conclusion

Super Experts(SEs) 발견 및 체계적 분석
SE 특징
- Down-proj outlier 발생
- Attention sink 생성 메커니즘 담당
- reasoning 성능 유지 필수
SE pruning → massive activations 소멸 + 성능 붕괴
결론: MoE 압축 시 SE 반드시 보존 필요
향후 연구: SE-aware compression 전략 개발

'연구 > Natural Language Processing' 카테고리의 다른 글

[Paper Review] A Simple and Effective Pruning Approach for Large Language Models (0)	2025.09.07
[Paper Review] Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning (5)	2025.07.13
[Paper Review] Entropy-Aware Branching for Improved Mathematical Reasoning (4)	2025.07.06
LLM-Check: Investigating Detection of Hallucinations in Large Language Models (NeurIPS 2024) (0)	2025.05.25
[Paper Review] LLM-Check: Investigating Detection of Hallucinations in Large Language Models (0)	2025.03.24

현재글[Paper Review] Unveiling Super Experts in Mixture-of-Experts Large Language Models

서히의 우당탕탕 코딩일기

https://github.com/seohee0925

BITAmin, 빅분기스터디, 빅데이터 연합동아리, python, 학회, 코테, 영진닷컴, 코딩테스트, 프로그래머스, 빅데이터학회, 빅분기, BDA, BDA학회, 딥러닝, programmers, AI, Deep Learning, 파이썬, 빅분기_실기, 이기적,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

서히의 우당탕탕 코딩일기