연구/Natural Language Processing

LLM-Check: Investigating Detection of Hallucinations in Large Language Models (NeurIPS 2024)

서히! 2025. 5. 25. 23:23

Paper: Gaurang Sriramanan, Siddhant Bharti, Vinu Sankar Sadasivan, Shoumik Saha, Priyatham Kattakinda, Soheil Feizi “LLM-Check: Investigating Detection of Hallucinations in Large Language Models” (NeurIPS 2024)

 

Problems

  • Prior approaches such as consistency checks and retrieval-based methods often require access to multiple model responses or large external databases, making them computationally expensive and impractical for real-time applications.
  • Existing techniques rely on single-aspect indicators like uncertainty or consistency, which limit their accuracy and generalizability across diverse hallucination types.
  • Most current methods cannot perform detection within a single response without additional computational overhead during inference, restricting their utility in deployment scenario.

Solution

  • LLM-Check analyzes internal LLM representations—including hidden states, attention maps, and output probabilities—to detect hallucinations within a single response in both white-box and black-box settings
  • The method operates without requiring multiple generations or external databases, achieving computational efficiency through eigenvalue analysis of covariance matrices and attention kernel maps

Main method

  • Eigenvalue Analysis of Internal Representations:
    • Hidden Score: Computes the mean log-determinant of covariance matrices from hidden states
    • Attention Score: Leverages the lower-triangular structure of autoregressive attention maps to calculate
  • Output Token Uncertainty: Quantifies perplexity and logit entropy, including windowed entropy to localize hallucinations within sequences

Pros / Cons
(Pros)

  •  Achieves 45–450× speedups over baselines while maintaining high detection accuracy.
  • Enables single-response detection without external databases.
  • Combines multiple detection modalities (hidden states, attention, probabilities) for robustness

(Cons)

  • Performance varies significantly across transformer layers, requiring careful layer selection.
  • Lacks theoretical justification for the correlation between attention eigenvalues and hallucinations.
  • Shows sensitivity to hallucination types (e.g., performs better on invented hallucinations than subjective ones)