Addressing Pitfalls in the Evaluation of Uncertainty Estimation Methods for Natural Language Generation

Research output: Chapter in Book/Report/Conference proceedingConference proceedingspeer-review

Abstract

Hallucinations are a common issue that undermine the reliability of large language models (LLMs). Recent studies have identified a specific subset of hallucinations, known as confabulations, which arise due to predictive uncertainty of LLMs. To detect confabulations, various methods for estimating predictive uncertainty in natural language generation (NLG) have been developed. These methods are typically evaluated by correlating uncertainty estimates with the correctness of generated text, with question-answering (QA) datasets serving as the standard benchmark. However, evaluating correctness in QA tasks is inherently challenging and can distort the perceived effectiveness of uncertainty estimation methods. Our results show that there is substantial disagreement between correctness functions and consequently the ranking of the uncertainty estimation methods is significantly influenced by that choice, allowing to inflate the performance of uncertainty estimation methods. We propose several alternative risk indicators for correlation assessment that improve robustness of empirical assessment of UE algorithms for NLG. For QA tasks, we show that averaging over multiple LLM-asa-judge variants leads to more reliable results. Furthermore, we explore structured tasks which provide unambiguous correctness functions. Finally, we propose to use an Elo rating of uncertainty estimation methods to give an objective summarization over extensive evaluation settings.
Original languageEnglish
Title of host publicationICLR Workshop: Quantify Uncertainty and Hallucination in Foundation Models: The Next Frontier in Reliable AI
Number of pages24
Edition1
Publication statusPublished - Jul 2025

Fields of science

  • 101019 Stochastics
  • 102003 Image processing
  • 103029 Statistical physics
  • 101018 Statistics
  • 101017 Game theory
  • 102001 Artificial intelligence
  • 202017 Embedded systems
  • 101016 Optimisation
  • 101015 Operations research
  • 101014 Numerical mathematics
  • 101029 Mathematical statistics
  • 101028 Mathematical modelling
  • 101026 Time series analysis
  • 101024 Probability theory
  • 102032 Computational intelligence
  • 102004 Bioinformatics
  • 102013 Human-computer interaction
  • 101027 Dynamical systems
  • 305907 Medical statistics
  • 101004 Biomathematics
  • 305905 Medical informatics
  • 101031 Approximation theory
  • 102033 Data mining
  • 102 Computer Sciences
  • 305901 Computer-aided diagnosis and therapy
  • 102019 Machine learning
  • 106007 Biostatistics
  • 102018 Artificial neural networks
  • 106005 Bioinformatics
  • 202037 Signal processing
  • 202036 Sensor systems
  • 202035 Robotics

JKU Focus areas

  • Digital Transformation

Cite this