Are AI Models Flunking Real-World Policy Tests?

Large language models, or LLMs, are the new darlings of the tech world, increasingly hailed for their prowess in reasoning tasks. But here's the kicker: evaluating real-world policy decisions, these models might not be as reliable as you'd hope.

Testing AI in the Real World

A recent study created a benchmark of 40 policy evaluation cases. These aren't just any cases. They're grounded in peer-reviewed research from economics and social science. And each case is classified based on how intuitive its outcomes are. Some are obvious. Others are ambiguous or downright counter-intuitive.

Four advanced LLMs were put to the test across five prompting strategies. With 8,000 experimental trials, the study aimed to see how well these models handle policy evaluations. The results? A mixed bag.

The Chain-of-Thought Paradox

One striking finding was the "chain-of-thought paradox." When LLMs were given clear, step-by-step reasoning prompts, their performance shot up on obvious cases. But on counter-intuitive ones, this advantage fizzled. The odds ratio? A staggering 0.278 with a p-value less than 0.001. Essentially, when outcomes defy common sense, even the best reasoning prompts can't save the day.

Intuition Trumps Tech

Another revelation? Intuition is king. The variance in case outcomes was more about the intuitiveness of the cases than which model or strategy was used. In statistical terms, the intra-class correlation coefficient was 0.671. What does this mean? Models struggle when faced with cases that challenge intuitive beliefs. It's less about the technology and more about human perception.

And here's a head-scratcher: models' familiarity with the content, as measured by citation-based data, didn't correlate with accuracy. Despite knowing the facts, models couldn't apply them effectively when intuition was contradicted.

What Does This Mean?

These results highlight a disconnect between knowledge and reasoning in AI models, echoing dual-process theory. While models exhibit "slow thinking," they falter at overcoming intuitive biases. So, are we overestimating the ability of AI to tackle real-world complexities?

This raises a critical question: Can AI ever fully replace the nuanced, intuitive understanding humans bring to policy evaluations? Until AI models can bridge the gap between knowledge and reasoning, perhaps we shouldn't rely on them as standalone decision-makers in complex policy arenas.

In the end, the study warns us not to get too comfortable with AI's flashy capabilities. They're impressive, sure, but when the chips are down and intuition is challenged, the limits become clear.