Decoding the Chemistry of Language Models

Large language models (LLMs) are increasingly finding their footing chemistry. But, which molecular representation is the best fit for these models? A recent benchmark study dived into this question, examining the performance of 16 LLMs across nine molecular representations and eight chemical tasks.

The Chemistry of Representation

Different molecular representations carry varying strengths. The study finds that structured text representations like CML and MolJSON excel in structural tasks, while IUPAC takes the crown in semantic tasks, notably molecule retrieval. SMILES variants, despite their popularity in pretraining, rarely hit the optimal mark.

If you're wondering why SMILES doesn't lead the pack despite its widespread use, the answer is specialization. Chemistry-specialized models perform well with SMILES but stumble when faced with structured text, indicating that a SMILES-only approach might favor specialization over generalization. Is the SMILES approach too narrow for the expansive needs of molecular LLMs?

Evaluating the Models

The benchmark study evaluated 16 LLMs across five model families, including those focused on reasoning and non-reasoning, chemistry-specialized, and closed frontier models. Despite the diversity, no single representation emerged as a universal champion across all tasks. CML led the way, followed by MolJSON and InChI, with canonical SMILES trailing behind.

By employing LLM-as-a-judge, the study revealed another intriguing insight: IUPAC representations produced the highest fraction of correct molecule generations. This isn't just a preference but a call for task-aware representation routing that could redefine how we approach LLM-based chemistry.

Rethinking Representation-Invariant Evaluation

Mechanistic studies using tokenization audits, linear probes, and attention analyses showed that different representations are encoded distinctly within the models. The structured representations demand higher attention across the molecular span, raising the question: Can a one-size-fits-all evaluation truly capture the intricacies of molecular LLMs?

The AI-AI Venn diagram is getting thicker. As the convergence of AI and chemistry deepens, it becomes clear that representation-invariant evaluation might not cut it. The need for dynamic, task-specific routing is evident, hinting at a future where LLMs are tailored to navigate these complex molecular landscapes more effectively.

Decoding the Chemistry of Language Models

The Chemistry of Representation

Evaluating the Models

Rethinking Representation-Invariant Evaluation

Key Terms Explained