The Unseen Variables in LLM Confidence Calibration
Evaluating LLM confidence calibration reveals the impact of measurement choices on perceived accuracy. Token-probability scores and verbalized confidence don't always align.
In the competitive landscape of large language models (LLMs), how do we truly gauge their confidence? It's a question that's more complicated than it seems, primarily because it hinges on how you measure it. Two primary signals are often compared: token-probability scores and verbalized confidence. The challenge is that these aren't straightforward readouts of model uncertainty. They require nuanced interpretation, and crucially, the measurement choices often go unstated.
The Measurement Maze
In an incisive analysis, researchers held the verbalized-confidence elicitation constant, using a single prompt template and a fixed probability scale. They then varied the axes that define the verbalized versus token comparison. What happens when you switch which answer string receives the token-probability score, or alter how that score is read from the answer tokens? It turns out, quite a bit. This design was tested on four QA benchmarks across three model families, involving 7, 8 billion parameter base and Instruct models, with larger Qwen2.5 variants serving as robustness checks.
The findings are telling. Changing the conditioning context can flip the sign or magnitude of the Expected Calibration Error (ECE) gap across settings. Token readout influences outcomes to a lesser extent but still causes notable shifts. Surprisingly, altering the ECE estimator had little effect. What's the takeaway here? Simply that under the default protocol, generated-answer, bare-context, Instruct settings hover near parity rather than exhibiting a significant calibration advantage for verbalized confidence.
Surface Plausibility vs. Correctness
Western coverage has largely overlooked this: when analyzing supplied answers separately, the model assigns nearly the same confidence to surface-plausible wrong answers as it does to correct ones. This suggests that verbalized confidence measures more than just correctness. It also reflects the plausibility or provenance of an answer. Shouldn't this act as a wake-up call to reconsider how we interpret LLM confidence?
The paper, published in Japanese, reveals that both signals, token-probability and verbalized confidence, are protocol-dependent behavioral measurements. They aren't standalone indicators of a model's certainty. This understanding is essential for anyone working with LLMs, whether they're tuning models or evaluating their outputs.
A Call for Transparency
The research concludes with a call for transparency. It offers a reporting checklist covering elicitation provenance, scored answer, token-probability readout, and conditioning context. Wouldn't it be a step forward if all AI research adopted such transparency?
Ultimately, the benchmark results speak for themselves. As AI models continue to evolve, the methods we use to evaluate them must also adapt. Ignoring the nuances of measurement could lead us to erroneous conclusions, impacting everything from deployment decisions to user trust.
Get AI news in your inbox
Daily digest of what matters in AI.