Why Large Language Models Falter in Causal Discovery

Causal discovery is a critical component of scientific inquiry, yet there's a striking gap in how well large language models can perform this task. Despite the hype surrounding these models, recent evaluations show a troubling plateau in their ability to handle even basic causal graphs. And as complexity ramps up, their performance nosedives. But why?

The Fundamental Limitation

It turns out, the issue isn't with the models themselves or the datasets they're fed. The problem is woven into the very fabric of their learning methodologies. Supervised fine-tuning, direct preference optimization, and in-context learning all falter because they can't differentiate between causal graphs that generate similar observational data. It's like trying to distinguish between identical twins based solely on their shadows. The crux of the matter is that for these models to succeed, their internal representations would need to grow without bounds, a scenario that's simply not feasible.

Introducing Agentic Causal Bayesian Optimization

Enter Agentic Causal Bayesian Optimization, or A-CBO. This innovative approach sidesteps the intrinsic limitations by operating outside the conventional space. Instead of altering the underlying model, A-CBO uses a frozen language model as an interventional oracle. It answers targeted queries about intervention effects, while an external Bayesian loop efficiently narrows down candidate graphs in logarithmically few rounds.

On the Corr2Cause benchmark, A-CBO matches its fine-tuned counterparts without any additional training. More impressively, on the Extended Corr2Cause benchmark, which scales up to 24 variables with a whopping 18,000 test samples, A-CBO doesn't just compete, it dominates. The gap between A-CBO and traditional methods grows as the complexity increases. This isn't just a minor improvement. it's a seismic shift in approach.

Why Should We Care?

So, why does this matter? The ability to accurately discern causal relationships is key in fields ranging from epidemiology to economics. If we can't trust large language models to perform reliable causal discovery, where does that leave us? Imagine the impact on policy-making or clinical trials if our models can't even get the basics right.

Color me skeptical, but the current fanfare around large language models needs a reality check. They're undoubtedly powerful, but their limitations are becoming glaringly apparent as we push the boundaries of what's expected from them. What they're not telling you: without inventive methodologies like A-CBO, these models risk becoming more of a technological curiosity than a tool of genuine scientific advancement.

Why Large Language Models Falter in Causal Discovery

The Fundamental Limitation

Introducing Agentic Causal Bayesian Optimization

Why Should We Care?

Key Terms Explained