Virtual Cells: Overpromised and Underperforming
Large Language Models (LLMs) are being used to simulate cellular responses. Yet, their predictions fall short. A new method, CORE, offers a promising fix.
Large Language Models (LLMs) have been hailed as the future of cellular simulation, promising insights into gene expression responses. But strip away the marketing, and you get a tool that's not quite ready for prime time. The reality is these models, while offering biologically plausible narratives, struggle with accurate predictions.
Beyond the Hype
LLMs are touted for their ability to simulate cellular environments, acting as 'virtual cells' to predict gene expression under various conditions. They aim to replace costly and sparse perturbation experiments, a cornerstone of understanding cellular mechanisms. However, these models often end up overestimating differential expression, frequently getting outperformed by a simple gene-frequency baseline.
Here's what the benchmarks actually show: In many cases, the performance of these models collapses to mere chance when evaluated at the per-gene level. Why does this happen? It's largely due to the models' reliance on intrinsic gene response tendencies rather than actual perturbation reasoning.
Introducing CORE
Enter CORE, or Contrastive Organization of Relational Evidence. This approach reimagines prediction as a comparison task. By organizing evidence into positive and negative outcomes, CORE differentiates the effects of related perturbations on the same gene. This makes a world of difference.
Using a biomedical knowledge graph for evidence retrieval, CORE boosts the calibration and accuracy of LLM-based predictions. On drug-perturbation data, CORE-Reasoning improved Qwen3.5-9B's metrics by up to 28.6%. On generic perturbation data, CORE-Voting increased macro-per-gene AUROC from chance levels to an average of 0.703 across four cell lines.
Why This Matters
So why should we care? The architecture matters more than the parameter count, and CORE's contrastive evidence organization is proving essential for reliable perturbation reasoning. It's a step toward making LLMs a practical tool for cellular simulations. But is it enough?
The numbers tell a different story. Despite the improvements, LLM-based simulations are still far from perfect. If we're to rely on these models in critical biomedical research, their accuracy needs to be more than just a technological promise. Will CORE's approach be the major shift? That's the question researchers and industry leaders need to ask.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A structured representation of information as a network of entities and their relationships.
Large Language Model.
A value the model learns during training — specifically, the weights and biases in neural network layers.
The ability of AI models to draw conclusions, solve problems logically, and work through multi-step challenges.