Can LLMs Really Reason? New Study Challenges Their 'Slow Thinking'
A recent study put large language models to the test in policy evaluation, revealing a stark disconnect between knowledge and reasoning. The results challenge the models' ability to truly think through complex, counter-intuitive scenarios.
Large language models (LLMs) are the rock stars of the AI world. Everyone's talking about their ability to perform complex tasks, from natural language processing to advanced reasoning. But, can they really think through complex problems, especially real-world policies? That's the question a recent study aimed to tackle.
The Experiment
Researchers constructed a benchmark of 40 empirical policy cases drawn from economics and social science. These cases weren't just random picks. each was backed by peer-reviewed evidence. The twist? They were classified by how intuitive the findings were. The categories included obvious, ambiguous, and counter-intuitive cases. The study put four latest LLMs through over 2,400 trials using five different prompts.
Here's where it gets interesting. The results showcased what they've dubbed the 'chain-of-thought paradox'. In layman's terms, when LLMs were guided through their reasoning process, they excelled at the obvious cases. But throw a curveball with counter-intuitive scenarios, and this advantage almost vanished.
Intuition vs. Computation
Now, here's something to chew on: the variance in performance was more about the intuitiveness of the cases than the choice of model or even the prompting strategy. It turns out that how 'obvious' a question seems can skew results more than you might think. In numbers, this factor explained more variance than you’d expect, with an intra-class correlation (ICC) of 0.537.
Think of it this way: models have the data, the citations, the familiarity, but when they need to go against the grain, their accuracy plummets. It's like having a library of knowledge but not knowing which book to open when your intuition says something different. This dissociation between knowledge and reasoning is a critical flaw if we're counting on these models for policy evaluation.
Are We Just Talking Slow?
The analogy I keep coming back to is dual-process theory: System 1 vs. System 2 thinking. System 1 is fast, intuitive, your gut feeling. System 2 is slower, more deliberate. The study suggests that what we see in LLMs as 'slow thinking' might just be 'slow talking'. They mimic deliberative reasoning without the substance.
Why does this matter? Well, if you've ever trained a model, you know the promise of AI isn't just in regurgitating data. It's in making leaps, drawing insights, and challenging the obvious. If current LLMs can't handle counter-intuitive reasoning, are we overestimating their capabilities?
Here's why this matters for everyone, not just researchers: one day, these models might help shape policies affecting healthcare, education, and more. If they can't reason through the unexpected, do we really want them in the driver's seat?
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A standardized test used to measure and compare AI model performance.
The process of measuring how well an AI model performs on its intended task.
The field of AI focused on enabling computers to understand, interpret, and generate human language.
The text input you give to an AI model to direct its behavior.