Can AI Really Innovate in Science? New Benchmarks Put LLMs to the Test
A new benchmark framework evaluates AI's ability to innovate in scientific discovery. The results show improvement but raise questions about true creativity.
Scientific discovery has always been a playground of creativity and unpredictability. It's more than just recalling facts or knowledge. Many benchmarks have aimed to evaluate large language models (LLMs) on deep research tasks through multi-hop retrieval. But the true test of their innovative reasoning, essential for genuine scientific discovery, remains elusive.
Introducing a Fresh Benchmark
We're now seeing a new benchmark framework designed to measure how well these models can perform in scientific discovery and reasoning. It builds from a raw problem right up to the classical null hypothesis test. In this setup, models start with just the topic and research question from a recent paper. As more technical details are gradually revealed, the model must generate hypotheses addressing the research question. These are then matched against the original paper's conclusions using automated semantic similarity checks.
The progressive evaluation ensures that from minimal information to full experimental details, a model's innovativeness and grounded reasoning skills are put to the test. It's about time we scrutinize these AI systems beyond surface-level achievements, donβt you think?
Testing LLMs on Real Science
Specifically, this framework has been applied to evaluate models including GPT-5, GPT-5.4, Gemini 2.5 pro, and Gemini 3.1 pro across 45 papers covering bioactive materials, mechanical materials, and nanomaterials. Unsurprisingly, GPT-5.4 and Gemini 3.1 pro showed better performance than their predecessors with GPT-5.4 keeping a 0.7 F1 score alignment with ground truth conclusions even with minimal context.
But let's not celebrate just yet. While these results are promising, they also highlight a critical gap: Are these models truly innovating or just better at mimicking known conclusions? The burden of proof sits with the team, not the community.
The Real Test: Creativity or Computation?
Here's the crux: Can AI really innovate in scientific discovery, or are we merely polishing its computational skills? If the goal is to develop next-gen AI systems capable of true scientific reasoning, we need more than just incremental improvements. We need a fundamental shift in how these models approach creative thinking. Is the AI industry ready for that leap?
Skepticism isn't pessimism. It's due diligence. We must push for transparency in how these benchmarks are set and whether they truly reflect the creative capabilities they claim to measure. Let's apply the standard the industry set for itself. Show me the audit.
Get AI news in your inbox
Daily digest of what matters in AI.