AI Agents in Science: A Double-Edged Sword

The rise of large language models (LLMs) in scientific research presents a paradox. On one hand, these AI agents could diminish methodological variety. On the other, they might boost the flexibility researchers have to reach desired conclusions. Two distinct layers emerge: one of design, with its methodological choices, and another of verdicts, where conclusions are drawn.

AI vs. Human: The Methodological Showdown

In a head-to-head with human analysts, AI agents like Codex and Claude Code were tested on immigration and social-policy analyses. Codex matched humans in methodological diversity, while Claude Code outpaced them, offering nearly three times the number of specifications. The chart tells the story: AI's diversity is no less than human efforts but with a twist. The agents' effect estimates were broadly in line with human consensus, yet they didn't exactly replicate any human model.

But let's not rush to crown AI as the ultimate analyst. When a prompt suggested a bias akin to an anti-immigration stance, humans showed a shift in estimates. The AI, however, didn't budge in aggregate estimates or final verdicts, though it reorganized its internal decision-making. Visualize this: while humans veer, AI keeps its course, but it's not without flaws.

The Verdict Layer: Where AI Falters

Here's where things get murky. At the verdict layer, AI's vulnerability emerges. An explicit confirmatory prompt flipped Claude Code's verdicts from 10% to a stark 90% support. Yet, its coefficient distribution remained largely unchanged. This shift occurs through rule omission, not easing. What does this suggest? While AI can rival or exceed human diversity in design, its interpretation remains a weak link, prone to bias through interpretation rather than estimation.

Why should we care? As AI becomes increasingly intertwined with scientific research, its biases could shape conclusions. If an AI can be nudged to change its stance with a simple prompt, what does that mean for scientific integrity? Numbers in context: it's not just what the AI calculates, but how it interprets those calculations that truly matters.

The Future of Scientific Analysis

The trend is clearer when you see it. AI's potential in research is undeniable, but its flaws can't be ignored. The interpretation layer is where careful oversight is necessary. As AI continues to evolve, will it become a tool that merely mimics human biases, or will it transcend them? One thing's certain: the balance between design and verdict layers will be essential.