The essential Choice of Molecular Representation in LLMs

Large language models (LLMs) are taking on new roles in molecular tasks, but one question is pressing: which molecular representation should we use? A recent benchmark study evaluated 16 LLMs across five model families and nine molecular representations. Their task? Navigating eight distinct chemical challenges.

Performance Tied to Representation

Here's what the benchmarks actually show: performance is heavily dependent on the type of representation used. CML (Chemical Markup Language) took the lead, followed by MolJSON and InChI. Canonical SMILES, despite its prevalence, lags behind. Explicit structured text representations like CML and MolJSON excel in structural tasks, while IUPAC shines in semantic tasks, dominating molecule retrieval across all 16 models.

This brings us to an interesting point. SMILES variants, often favored in pretraining, rarely hit the mark. Chemistry-specialized models do well with SMILES but falter with structured texts, indicating a narrow specialization that stumbles in broader applications. The architecture matters more than the parameter count.

Judging Molecular Generation

When LLMs judge molecular generation accuracy, IUPAC stands out. It's the go-to for correct molecule generation. This raises a question: are we underestimating the importance of semantic clarity in our quest for computational efficiency?

A deep dive into mechanistic details, using tokenization audits, linear probes, and attention analysis, reveals that not all representations are created equal inside the LLMs. Structured representations demand more attention across the molecular span. This isn't just a technical tidbit. it highlights the non-trivial choice of representation in optimizing LLM performance for specific tasks.

The Takeaway: Choose Wisely

The numbers tell a different story when stripped of marketing gloss. A one-size-fits-all representation doesn't exist. Instead, the task at hand should drive the choice. The study argues against a representation-invariant approach, pushing for a task-aware strategy in LLM-based chemistry work.

For anyone in the field, this means carefully considering which representation best suits your specific needs. It's not just about picking a model. It's about aligning your tools with your goals, ensuring that the path from data to insight is as clear and effective as possible.

The essential Choice of Molecular Representation in LLMs

Performance Tied to Representation

Judging Molecular Generation

The Takeaway: Choose Wisely

Key Terms Explained