Decoding the Chemistry of Language Models
Reviewing large language models in molecular tasks reveals the complexities and dependencies of chemical representations. The results challenge the idea of a one-size-fits-all approach.
Large language models (LLMs) are increasingly finding their footing chemistry. But, which molecular representation is the best fit for these models? A recent benchmark study dived into this question, examining the performance of 16 LLMs across nine molecular representations and eight chemical tasks.
The Chemistry of Representation
Different molecular representations carry varying strengths. The study finds that structured text representations like CML and MolJSON excel in structural tasks, while IUPAC takes the crown in semantic tasks, notably molecule retrieval. SMILES variants, despite their popularity in pretraining, rarely hit the optimal mark.
If you're wondering why SMILES doesn't lead the pack despite its widespread use, the answer is specialization. Chemistry-specialized models perform well with SMILES but stumble when faced with structured text, indicating that a SMILES-only approach might favor specialization over generalization. Is the SMILES approach too narrow for the expansive needs of molecular LLMs?
Evaluating the Models
The benchmark study evaluated 16 LLMs across five model families, including those focused on reasoning and non-reasoning, chemistry-specialized, and closed frontier models. Despite the diversity, no single representation emerged as a universal champion across all tasks. CML led the way, followed by MolJSON and InChI, with canonical SMILES trailing behind.
By employing LLM-as-a-judge, the study revealed another intriguing insight: IUPAC representations produced the highest fraction of correct molecule generations. This isn't just a preference but a call for task-aware representation routing that could redefine how we approach LLM-based chemistry.
Rethinking Representation-Invariant Evaluation
Mechanistic studies using tokenization audits, linear probes, and attention analyses showed that different representations are encoded distinctly within the models. The structured representations demand higher attention across the molecular span, raising the question: Can a one-size-fits-all evaluation truly capture the intricacies of molecular LLMs?
The AI-AI Venn diagram is getting thicker. As the convergence of AI and chemistry deepens, it becomes clear that representation-invariant evaluation might not cut it. The need for dynamic, task-specific routing is evident, hinting at a future where LLMs are tailored to navigate these complex molecular landscapes more effectively.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A mechanism that lets neural networks focus on the most relevant parts of their input when producing output.
A standardized test used to measure and compare AI model performance.
The process of measuring how well an AI model performs on its intended task.
Large Language Model.