Are AI's Reasoning Steps Just Window Dressing?
A recent study suggests most AI models' step-by-step reasoning is decorative rather than functional. Two models show genuine reasoning dependence, but the broader industry needs a shift in evaluation methods.
Language models today are flaunting their ability to 'show their work' by offering step-by-step reasoning before providing an answer. Yet, a key question arises: Are these steps genuinely part of the decision-making process, or are they just decorative narratives generated post-decision?
Decorative Reasoning in AI
Consider this scenario: a medical AI suggests a diagnosis of cholesterol embolization syndrome, citing symptoms like eosinophilia. Now, if we remove eosinophilia from the equation, does the diagnosis change? For most frontier models, the answer is no. The step is merely decorative.
Recent research introduces a method known as step-level evaluation. By removing one reasoning sentence at a time, we can check if the model's answer changes. This method is surprisingly cost-effective, requiring only API access and costing about $1-2 per model per task.
The Numbers Tell the Story
Testing ten frontier models, including the likes of GPT-5.4 and Claude Opus, across various tasks reveals a startling pattern. For most models, removing any step changes the answer less than 17% of the time, indicating these steps are often superficial. Even in math tasks, where smaller models (parameter count: 0.8-8B) show genuine step dependence, the necessity rate peaks at 55%.
Notably, two models defied this trend: MiniMax-M2.5 on sentiment analysis showed 37% necessity, while Kimi-K2.5 on topic classification reached 39%. However, both models took shortcuts on other tasks. The takeaway? Faithfulness in reasoning is both model-specific and task-specific.
Output Rigidity and Its Implications
Another intriguing discovery is 'output rigidity.' On identical medical questions, Claude Opus produced 11 diagnostic steps, whereas GPT-OSS-120B delivered just a single token. Mechanistic analysis, particularly attention patterns, confirmed that attention drops more significantly in later layers for decorative tasks (33%) compared to faithful tasks (20%).
What the English-language press missed: These findings suggest that step-by-step explanations from frontier models are largely decorative. This revelation should prompt a reevaluation of how we assess AI reasoning. It's not about scaling up models but refining training objectives to ensure genuine reasoning.
Why should this matter to readers? If AI's reasoning isn't genuinely integrated into its decision-making process, how can we trust its conclusions? With AI playing an increasingly prominent role in critical fields, transparency and genuine reasoning aren't just desirable but essential.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A mechanism that lets neural networks focus on the most relevant parts of their input when producing output.
A machine learning task where the model assigns input data to predefined categories.
Anthropic's family of AI assistants, including Claude Haiku, Sonnet, and Opus.
The process of measuring how well an AI model performs on its intended task.