Debunking LLM's Reasoning Insights: The Format Illusion

Recent evaluations of the Qwen3-14B language model have cast doubt on the assumption that these models genuinely differentiate between deductive, inductive, and abductive reasoning. The paper, published in Japanese, reveals that while linear probes at layer 32 of 40 in Qwen3-14B achieve a seemingly impressive 100% cross-validated accuracy on benchmarks like LogiQA 2.0, ARC-Challenge, and αNLI, this accuracy is misleading.

The Role of Format Confounds

What the English-language press missed: these distinct reasoning types aren't what drive the model's accuracy. Instead, the clear separation observed is largely due to format confounds such as source identity, option count, and response length. When these factors are accounted for, the accuracy plummets to mere chance levels. It raises a key question: are these language models genuinely understanding reasoning, or merely exploiting superficial cues?

Shared Reasoning, Not Distinct Paths

The benchmark results speak for themselves. Trace-anchor similarity measures show a 42.5% agreement in reasoning tasks, compared to a 33.3% chance level. This suggests that the model shares a common approach across tasks, rather than adopting distinct reasoning modes for each. Causal steering experiments with 20 random controls further support this, showing no functional link between the geometry of model representations and reasoning mode, with a p-value of 0.286.

Why This Matters

Many in the AI community have touted the ability of large language models to mimic human-like reasoning. However, these findings suggest that we may be overestimating their capabilities. If task format, rather than true reasoning ability, is driving model performance, then the reliability of such models in critical applications must be questioned. Shouldn't we demand more rigorous evaluations that separate genuine intelligence from statistical trickery?

The data shows that high probe accuracy reflects little more than task format. If we're serious about advancing AI, routine format deconfounding should be a priority in mechanistic interpretability. Without it, we risk building systems that are as shallow as their training data allows. Western coverage has largely overlooked this key analysis, but the implications for AI development can't be ignored.

Debunking LLM's Reasoning Insights: The Format Illusion

The Role of Format Confounds

Shared Reasoning, Not Distinct Paths

Why This Matters

Key Terms Explained