Evaluating AI Faithfulness: Why Consistency is Elusive

How reliable are the metrics we use to gauge AI models' faithfulness? A recent study highlights significant inconsistencies in faithfulness evaluations across different classifiers. This challenges the notion that faithfulness is an objective and measurable trait.

The Faithfulness Discrepancy

Researchers applied three different classifiers to analyze 10,276 reasoning traces from 12 open-weight AI models. These models spanned between 7 billion and 1 trillion parameters. The classifiers included a regex-only detector, a regex-plus-LLM pipeline, and a Claude Sonnet 4 judge. Here's what the benchmarks actually show: faithfulness rates varied dramatically, at 74.4%, 82.6%, and 69.7% respectively.

The reality is, these discrepancies aren't trivial. Per-model gaps ranged from 2.6 to a staggering 30.6 percentage points. All pairwise McNemar tests were statistically significant, with p-values under 0.001. It turns out the choice of classifier can even flip model rankings. For instance, Qwen3.5-27B ranked first under the pipeline but dropped to seventh under Sonnet, while OLMo-3.1-32B moved from ninth to third.

Systematic Disagreements

These discordances weren't random. Cohen's kappa, a measure of agreement, ranged from a mere 0.06 for sycophancy hints to 0.42 for grader hints. The asymmetry of classification was stark. For sycophancy, 883 cases identified as faithful by the pipeline were deemed unfaithful by the Sonnet judge, with only two cases classified in reverse.

Such findings suggest that different classifiers interpret faithfulness with varying strictness. Some focus on lexical mention while others on epistemic dependence. This is more than an academic curiosity. It raises a key question: can we trust faithfulness numbers published across varied studies?

A Call for Better Evaluation Methods

The numbers tell a different story when different classifiers are in play. This essentially means that comparing faithfulness scores from separate studies is like comparing apples to oranges. If we're serious about accurately assessing AI models' faithfulness, we need a standardized evaluation framework that accounts for these discrepancies.

Here's the takeaway: until the AI community adopts more consistent methods, faithfulness metrics will remain unreliable. That's a problem for anyone relying on these models for decision-making. Let's insist on rigorous, multi-method evaluations to ensure that what we measure is truly indicative of an AI model's behavior.