Are Language Models Faking It Data Integrity?
Language models like Claude and OLMo can spot fake stats but fail in multi-source synthesis, treating all numbers equally. Epistemic alignment is the issue.
Language models have become essential tools, acting as our digital epistemic proxies. But here's the kicker: when tasked with synthesizing information from multiple sources, they can't seem to tell fake statistics from real ones. Whether it's Claude, Qwen, or OLMo, these models stumble at distinguishing valid data in complex settings, even though they can identify fabricated statistics in isolation with near-perfect accuracy. The numbers don't lie, identification rates range from 0.76 to 1.00 when methodology stands alone. Yet, this capability falters when juggling multiple data feeds.
The Fault in Our Models
The real issue lies in what researchers are calling a 'methodology-register gate.' It's a mechanism that assesses the style and presentation of analytical text, not the validity of its numeric content. Statistically implausible confidence intervals are given the same weight as valid ones. That's a big oversight. Who's to blame if a model equates statistical noise with signal?
What's more, this problem isn't confined to one type of language model. It's prevalent across three distinct families and professional domains. Causal tracing and linear probes show a consistent pattern: while models encode and use a methodology-register representation across domains, numeric-validity signals drop to chance levels during multi-source synthesis.
Mitigations That Miss the Mark
Attempts to fix this through prompting-based mitigations, even using an oracle checklist for statistical checks, have been futile. Instead of discerning better, models adopt blanket skepticism. Post-training pipelines reinforce these stylistic shortcuts, failing to include numeric verification. The issue isn't as simple as user preference tracking, it's about whether a source appears analytically credible, not whether its claims hold water.
This entire episode raises an uncomfortable question: if these models can't discern truth from fiction in multi-source contexts, what are we really paying for? Slapping a model on a GPU rental isn't a convergence thesis. The intersection is real. Ninety percent of the projects aren't. It's high time we ask if these models are truly aligning with our epistemic needs, or just mimicking them.
Get AI news in your inbox
Daily digest of what matters in AI.