Can AI Models Really Spot Nonsense? BullshitBench Puts Them to the Test
New benchmark BullshitBench reveals AI models struggle to detect nonsense. While Anthropic edges ahead, Google's Gemini 3.0 falls short.
Artificial intelligence is often lauded for its ability to solve complex problems, but a new benchmark called BullshitBench is challenging the wisdom of AI by testing its capability to spot sheer nonsense. Peter Gostev, AI capability lead at Arena, developed this quirky test to see whether AI models can differentiate between realistic prompts and absurdities.
What the English-language press missed
The test cleverly presents AI systems with prompts that sound technical yet collapse upon scrutiny. A standout example includes a question about the 'viscosity in centipoise of our deal pipeline.' The correct response is to detect the nonsense and refuse to engage, but many AI models fail, giving it a serious answer instead.
Google's Gemini 3.0: A disappointing performance
Google's Gemini 3.0, once celebrated as a top-tier model, performed poorly on BullshitBench. Only in less than half of the cases did it identify the nonsense, highlighting a significant gap between claimed capabilities and actual performance under this unconventional test. The benchmark results speak for themselves.
Why does this matter? In the real world, discerning flawed premises is key, a skill that current AI systems should hone. If models can't spot obvious nonsense, how can they be trusted with more nuanced tasks?
Anthropic's models shine where others falter
Interestingly, Anthropic's models outperformed others, effectively rejecting nonsense prompts more frequently. Gostev attributes this success to Anthropic's focus on building strong foundational models rather than overemphasizing reasoning capabilities that may backfire by attempting to make sense of the absurd.
This raises the question: are AI labs focusing too much on 'intelligence' at the expense of basic judgment? The data shows a clear need to revisit priorities.
Capability vs judgment
The BullshitBench results point to a deeper issue: the gap between AI's raw capability and its basic judgment. While AI models are adept at handling complex tasks, they often overlook fundamental cognitive checks that humans instinctively apply. This revelation is a wake-up call for AI developers to balance high-level intelligence with essential judgment skills.
Western coverage has largely overlooked this aspect of AI development. As these models become more integrated into decision-making processes, ensuring they can distinguish nonsense from valid input becomes ever more critical.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
An AI safety company founded in 2021 by former OpenAI researchers, including Dario and Daniela Amodei.
The science of creating machines that can perform tasks requiring human-like intelligence — reasoning, learning, perception, language understanding, and decision-making.
A standardized test used to measure and compare AI model performance.
Google's flagship multimodal AI model family, developed by Google DeepMind.