Virtual Cells: Overpromised and Underperforming

Large Language Models (LLMs) have been hailed as the future of cellular simulation, promising insights into gene expression responses. But strip away the marketing, and you get a tool that's not quite ready for prime time. The reality is these models, while offering biologically plausible narratives, struggle with accurate predictions.

Beyond the Hype

LLMs are touted for their ability to simulate cellular environments, acting as 'virtual cells' to predict gene expression under various conditions. They aim to replace costly and sparse perturbation experiments, a cornerstone of understanding cellular mechanisms. However, these models often end up overestimating differential expression, frequently getting outperformed by a simple gene-frequency baseline.

Here's what the benchmarks actually show: In many cases, the performance of these models collapses to mere chance when evaluated at the per-gene level. Why does this happen? It's largely due to the models' reliance on intrinsic gene response tendencies rather than actual perturbation reasoning.

Introducing CORE

Enter CORE, or Contrastive Organization of Relational Evidence. This approach reimagines prediction as a comparison task. By organizing evidence into positive and negative outcomes, CORE differentiates the effects of related perturbations on the same gene. This makes a world of difference.

Using a biomedical knowledge graph for evidence retrieval, CORE boosts the calibration and accuracy of LLM-based predictions. On drug-perturbation data, CORE-Reasoning improved Qwen3.5-9B's metrics by up to 28.6%. On generic perturbation data, CORE-Voting increased macro-per-gene AUROC from chance levels to an average of 0.703 across four cell lines.

Why This Matters

So why should we care? The architecture matters more than the parameter count, and CORE's contrastive evidence organization is proving essential for reliable perturbation reasoning. It's a step toward making LLMs a practical tool for cellular simulations. But is it enough?

The numbers tell a different story. Despite the improvements, LLM-based simulations are still far from perfect. If we're to rely on these models in critical biomedical research, their accuracy needs to be more than just a technological promise. Will CORE's approach be the major shift? That's the question researchers and industry leaders need to ask.

Virtual Cells: Overpromised and Underperforming

Beyond the Hype

Introducing CORE

Why This Matters

Key Terms Explained