AI in Science: Still Needs Human Support
Despite bold claims, AI's role in scientific discovery remains limited. A recent study reveals where AI excels and where it falters.
Artificial Intelligence has been hailed as the harbinger of a new era in scientific discovery. But are these claims racing ahead of reality? A recent massive evaluation involving 121,640 authors from various scientific fields sheds light on what AI can and can't do in scientific research.
The Experiment
Scientists from biology, medicine, chemistry, and social sciences were invited to evaluate follow-up ideas generated by large language models (LLMs) based on their own research papers. Out of those invited, 6,749 experts returned with 25,139 sets of ratings. They assessed ideas on novelty, empirical feasibility, probability, and favorability.
Three interesting patterns emerged. First, non-reasoning LLMs tend to gravitate towards similar ideas, forming a sort of 'hivemind'. In contrast, reasoning models explore a broader hypothesis space. Yet, neither introduces null hypotheses like human researchers do. This highlights a significant limitation in AI's creative capacity.
Human Biases and AI Limitations
The findings also reveal a human bias. Scientists tend to favor ideas that mirror their own, valuing probability over novelty. Interestingly, social scientists are more open to risk compared to life scientists. But senior social scientists remain the harshest critics, especially when AI stumbles in complex fields requiring nuanced interpretation.
In this context, the skepticism of social scientists seems justified. AI models struggle most where context and evolving theories are key, just like in the social sciences.
Weak Agreement with Experts
Another significant discovery is the weak alignment between AI-driven evaluations and expert judgment. Today's automated evaluators like LLM-as-a-judge and other artificial metrics only marginally align with expert opinions. Even when using state-of-the-art (SOTA) models, the gap persists.
However, a new approach using a Qwen3-14B post-trained reward model on human ratings shows promise. It captures the subtleties of different fields, outperforming existing SOTA models by up to 27% and approaching the reliability of independent peer reviews.
The Human Element
So, what does this mean for the future of AI in science? Despite the hype, current AI models are collaborators needing human grounding. They're not ready to replace human intuition and creativity. The builders never left, and it's evident that AI still requires human imagination to reach its full potential.
What happens when AI can finally propose null hypotheses and grasp complex theories independently? Until then, the meta's shifted, but the human touch remains key in science.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
The science of creating machines that can perform tasks requiring human-like intelligence — reasoning, learning, perception, language understanding, and decision-making.
In AI, bias has two meanings.
The process of measuring how well an AI model performs on its intended task.
Connecting an AI model's outputs to verified, factual information sources.