AI in Science: Still Needs Human Support

Artificial Intelligence has been hailed as the harbinger of a new era in scientific discovery. But are these claims racing ahead of reality? A recent massive evaluation involving 121,640 authors from various scientific fields sheds light on what AI can and can't do in scientific research.

The Experiment

Scientists from biology, medicine, chemistry, and social sciences were invited to evaluate follow-up ideas generated by large language models (LLMs) based on their own research papers. Out of those invited, 6,749 experts returned with 25,139 sets of ratings. They assessed ideas on novelty, empirical feasibility, probability, and favorability.

Three interesting patterns emerged. First, non-reasoning LLMs tend to gravitate towards similar ideas, forming a sort of 'hivemind'. In contrast, reasoning models explore a broader hypothesis space. Yet, neither introduces null hypotheses like human researchers do. This highlights a significant limitation in AI's creative capacity.

Human Biases and AI Limitations

The findings also reveal a human bias. Scientists tend to favor ideas that mirror their own, valuing probability over novelty. Interestingly, social scientists are more open to risk compared to life scientists. But senior social scientists remain the harshest critics, especially when AI stumbles in complex fields requiring nuanced interpretation.

In this context, the skepticism of social scientists seems justified. AI models struggle most where context and evolving theories are key, just like in the social sciences.

Weak Agreement with Experts

Another significant discovery is the weak alignment between AI-driven evaluations and expert judgment. Today's automated evaluators like LLM-as-a-judge and other artificial metrics only marginally align with expert opinions. Even when using state-of-the-art (SOTA) models, the gap persists.

However, a new approach using a Qwen3-14B post-trained reward model on human ratings shows promise. It captures the subtleties of different fields, outperforming existing SOTA models by up to 27% and approaching the reliability of independent peer reviews.

The Human Element

So, what does this mean for the future of AI in science? Despite the hype, current AI models are collaborators needing human grounding. They're not ready to replace human intuition and creativity. The builders never left, and it's evident that AI still requires human imagination to reach its full potential.

What happens when AI can finally propose null hypotheses and grasp complex theories independently? Until then, the meta's shifted, but the human touch remains key in science.