Can AI Models Really Spot Nonsense? BullshitBench Puts...

Can AI Models Really Spot Nonsense? BullshitBench Puts Them to the Test

By Rina ShimizuMarch 25, 20262 views

New benchmark BullshitBench reveals AI models struggle to detect nonsense. While Anthropic edges ahead, Google's Gemini 3.0 falls short.

Artificial intelligence is often lauded for its ability to solve complex problems, but a new benchmark called BullshitBench is challenging the wisdom of AI by testing its capability to spot sheer nonsense. Peter Gostev, AI capability lead at Arena, developed this quirky test to see whether AI models can differentiate between realistic prompts and absurdities.

What the English-language press missed

The test cleverly presents AI systems with prompts that sound technical yet collapse upon scrutiny. A standout example includes a question about the 'viscosity in centipoise of our deal pipeline.' The correct response is to detect the nonsense and refuse to engage, but many AI models fail, giving it a serious answer instead.

Google's Gemini 3.0: A disappointing performance

Google's Gemini 3.0, once celebrated as a top-tier model, performed poorly on BullshitBench. Only in less than half of the cases did it identify the nonsense, highlighting a significant gap between claimed capabilities and actual performance under this unconventional test. The benchmark results speak for themselves.

Why does this matter? In the real world, discerning flawed premises is key, a skill that current AI systems should hone. If models can't spot obvious nonsense, how can they be trusted with more nuanced tasks?

Anthropic's models shine where others falter

Interestingly, Anthropic's models outperformed others, effectively rejecting nonsense prompts more frequently. Gostev attributes this success to Anthropic's focus on building strong foundational models rather than overemphasizing reasoning capabilities that may backfire by attempting to make sense of the absurd.

This raises the question: are AI labs focusing too much on 'intelligence' at the expense of basic judgment? The data shows a clear need to revisit priorities.

Capability vs judgment

The BullshitBench results point to a deeper issue: the gap between AI's raw capability and its basic judgment. While AI models are adept at handling complex tasks, they often overlook fundamental cognitive checks that humans instinctively apply. This revelation is a wake-up call for AI developers to balance high-level intelligence with essential judgment skills.

Western coverage has largely overlooked this aspect of AI development. As these models become more integrated into decision-making processes, ensuring they can distinguish nonsense from valid input becomes ever more critical.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.