Can AI Really Think Like Us? The Struggle with...

Large language models (LLMs) were initially the darlings of generative AI. But now, the buzz is about their evolution into agentic AI systems. These systems can make decisions in complex real-world settings. However, while their generative capabilities have dazzled us, their decision-making chops lag behind. They often stumble over exceptions, a key part of any nuanced decision-making process.

Where Do LLMs Trip Up?

LLMs, even those that shine in reasoning tasks, tend to stick rigidly to set policies. This rigidity can lead to choices that are impractical or downright counterproductive. It's not surprising if you've ever fought with a chatbot that just didn't get your situation. The real story here's that these models struggle with exceptions due to the inherent incompleteness of contracts and rules.

The team behind recent research tried three techniques to help AI handle exceptions: ethical framework prompting, chain-of-thought reasoning, and supervised fine-tuning. Spoiler alert: ethical prompting didn't work. Chain-of-thought reasoning offered slight improvement. But it was supervised fine-tuning that stole the show. Especially when models were tuned using human explanations, not just yes-or-no labels.

The Power of Supervised Fine-Tuning

What really stood out was how supervised fine-tuning allowed models to generalize human-like decision-making to brand new situations. This isn't just transfer learning. It's about aligning AI decisions with human judgment across different contexts. Sure, the numbers back this up, but the implications for AI development are what's really fascinating.

If AI can start to make decisions that feel human, the potential is enormous. But let's not get ahead of ourselves. The pitch deck says one thing. The product says another. The models still need significant work to handle exceptions as fluidly as a human would. The founder story is interesting. But the metrics showing AI's struggle with exceptions tell us more about where we need to focus.

Why Should We Care?

So, why does this matter? The future of AI hinges not just on generative capabilities but on decision-making skills that can keep up with human intuition. Without this, we're left with machines that can produce text but not truly understand it. What matters is whether anyone's actually using this in real-world scenarios where judgment counts.

Here's a thought: if supervised fine-tuning with human explanations can really bridge the gap, should we be investing more in this area? The grind of making AI more human-like isn't just about better algorithms. It's about real-world usability, reducing churn, and increasing retention of AI solutions in everyday applications.

As AI continues to evolve, let's not lose sight of the fundamental question: can these systems make decisions like us, or are they forever doomed to be mere calculators with a knack for language? The jury's still out, but the signs are promising.

Can AI Really Think Like Us? The Struggle with Human-Like Decisions

Where Do LLMs Trip Up?

The Power of Supervised Fine-Tuning

Why Should We Care?

Key Terms Explained