The essential Choice of Molecular Representation in LLMs
Different molecular representations impact LLM performance across tasks. No one-size-fits-all solution exists, prompting a careful selection based on task needs.
Large language models (LLMs) are taking on new roles in molecular tasks, but one question is pressing: which molecular representation should we use? A recent benchmark study evaluated 16 LLMs across five model families and nine molecular representations. Their task? Navigating eight distinct chemical challenges.
Performance Tied to Representation
Here's what the benchmarks actually show: performance is heavily dependent on the type of representation used. CML (Chemical Markup Language) took the lead, followed by MolJSON and InChI. Canonical SMILES, despite its prevalence, lags behind. Explicit structured text representations like CML and MolJSON excel in structural tasks, while IUPAC shines in semantic tasks, dominating molecule retrieval across all 16 models.
This brings us to an interesting point. SMILES variants, often favored in pretraining, rarely hit the mark. Chemistry-specialized models do well with SMILES but falter with structured texts, indicating a narrow specialization that stumbles in broader applications. The architecture matters more than the parameter count.
Judging Molecular Generation
When LLMs judge molecular generation accuracy, IUPAC stands out. It's the go-to for correct molecule generation. This raises a question: are we underestimating the importance of semantic clarity in our quest for computational efficiency?
A deep dive into mechanistic details, using tokenization audits, linear probes, and attention analysis, reveals that not all representations are created equal inside the LLMs. Structured representations demand more attention across the molecular span. This isn't just a technical tidbit. it highlights the non-trivial choice of representation in optimizing LLM performance for specific tasks.
The Takeaway: Choose Wisely
The numbers tell a different story when stripped of marketing gloss. A one-size-fits-all representation doesn't exist. Instead, the task at hand should drive the choice. The study argues against a representation-invariant approach, pushing for a task-aware strategy in LLM-based chemistry work.
For anyone in the field, this means carefully considering which representation best suits your specific needs. It's not just about picking a model. It's about aligning your tools with your goals, ensuring that the path from data to insight is as clear and effective as possible.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A mechanism that lets neural networks focus on the most relevant parts of their input when producing output.
A standardized test used to measure and compare AI model performance.
Large Language Model.
A value the model learns during training — specifically, the weights and biases in neural network layers.