Are Language Models Really Thinking? Not So Fast.
A closer look at linear probing shows that high accuracy in language models might just be smoke and mirrors. The devil's in the data formats, not the reasoning skills.
Linear probing of large language model hidden states has become a favored method for claiming that these models possess distinct reasoning capabilities. But how much of this is genuine insight and how much is just an artifact of the method?
The Promise of Probing
Let's take Qwen3-14B, a model subjected to scrutiny across three benchmarks: LogiQA 2.0 for deductive reasoning, ARC-Challenge for inductive reasoning, and αNLI for abductive reasoning. At layer 32 out of 40, linear probes boasted a 100% cross-validated accuracy, with the intrinsic dimensionalities showing elegant separation: 20.6, 28.5, and 33.6, respectively. Add to this a convex hull contamination of less than 1.5%, and the results seem conclusive, right?
Not quite. The claim doesn't survive scrutiny. What they're not telling you: this impressive separation is merely a mirage, driven by format confounds rather than any intrinsic computational prowess.
When Separation Fails
Residualizing for source identity, option count, and response length, the accuracy plummets to what you'd expect by chance. In layman's terms, when you strip away the superficial differences in data format, the model's supposed reasoning abilities evaporate. Trace-anchor similarity metrics further reveal only a 42.5% agreement across tasks, barely above the 33.3% you'd get by flipping a coin.
Color me skeptical, but when causal steering experiments with random controls ($n=20$) reveal no functional link between geometry and reasoning mode (p=0.286), it becomes clear that what these probes are actually measuring is task format, not cognitive competence.
Implications for AI Interpretability
So, why does this matter? If you're in the business of AI, it's essential to understand that high probe accuracy isn't the holy grail you've been sold. It's a stark reminder that mechanistic interpretability requires more than just superficial metrics of success. It's time the field adopts routine format deconfounding methods to avoid cherry-picked conclusions that mislead rather than enlighten.
the allure of being able to claim that our models 'think' in human-like ways is strong. But I've seen this pattern before: hype outpaces the science, leaving practitioners scrambling to catch up when the reality sets in.
In a world hungry for AI that's not just efficient but trustworthy, we can't afford to rest on our laurels. Let's apply some rigor here and ensure that our claimed breakthroughs are genuine, not just smoke and mirrors.
Get AI news in your inbox
Daily digest of what matters in AI.