Sniffing Out AI's Limits: The Olfactory Perception Benchmark
LLMs are now being tested on their ability to reason about smells. The Olfactory Perception benchmark shows they struggle more with molecular structure than with names.
JUST IN: AI's capabilities are expanding into areas you wouldn't expect, like understanding smells. The Olfactory Perception (OP) benchmark is here to push the limits of large language models (LLMs) by testing them on a subject more at home in a perfumery than a server room.
What's the OP Benchmark?
Consisting of 1,010 questions across eight categories, the OP benchmark dives into odor classification, primary descriptor identification, intensity judgments, and more. That's a wild range of tasks that aim to see if LLMs can reason like a sommelier or a perfumer.
Questions are presented in two formats: compound names and isomeric SMILES. The results? Compound names are the clear winners, outperforming SMILES prompts by a significant margin, anywhere from 2.4 to nearly 19 percentage points. On average, that's a bump of about 7 points.
Why Does This Matter?
We're seeing that current LLMs rely more on lexical cues than the nitty-gritty of chemical structures. The best-performing model hit 64.4% accuracy, proving there's both promise and plenty of room for growth in how AI handles olfactory data.
Are we expecting too much from these models? Maybe. But in a world where AI is expected to understand every nuance, why shouldn't it be able to differentiate between the scent of a rose and a rotten egg?
Going Global
There's more. The benchmark also tested a subset of questions across 21 languages. It turns out that aggregating predictions across multiple languages boosted the models' olfactory prediction powers, with the top language ensemble model achieving an AUROC of 0.86.
This changes the landscape. AI isn't just about analyzing text or images anymore. The labs are scrambling to teach these models how to 'smell'. And just like that, the leaderboard shifts again.
Get AI news in your inbox
Daily digest of what matters in AI.