MERRIN: Testing Search Agents in the Wild Web

world of AI, MERRIN emerges as a rigorous benchmark for testing search-augmented agents in the chaos of the web. Short for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments, MERRIN sets a high bar, demanding agents to sift through conflicting and diverse data formats, text, video, and audio, without explicit guidance.

A New Kind of Challenge

Unlike traditional search benchmarks, MERRIN forces AI to think beyond simple keyword matching. With natural language queries devoid of modality hints, the task is to retrieve and reason over evidence that might not just be textual. It's like asking AI to find a needle in a haystack where the needle could be sound, a clip, or text.

The models are put through their paces: ten in total, including proprietary systems like GPT-5.4-mini and the open-weight Qwen3 family. They operate in three search settings, no search, native search, and agentic search. Yet, even the best performers struggle. The top agent clocks in at a mere 40.1% accuracy, while the average languishes at 22.3%. Clearly, the agents are outmatched by the complexity of the task.

Humans Still Have the Edge

So, why are these advanced agents trailing behind human performance? The answer seems to lie in their inefficiency. While agents like Gemini Deep Research are designed for exploration, they often overdo it. They take more steps and employ more tools than necessary, yet they're easily sidetracked by irrelevant or conflicting information.

In contrast, humans, with their knack for contextual understanding, tend to zero in on relevant sources more efficiently. If the AI can hold a wallet, who writes the risk model? It seems our digital counterparts are still figuring out the basics of resource management.

What MERRIN Means for AI Development

MERRIN's findings aren't just academic. They highlight a important gap in AI development: the ability to perform strong, multimodal searches in noisy environments. For anyone betting on AI to revolutionize search, these results are a wake-up call. Show me the inference costs. Then we'll talk about real-world deployment.

Decentralized compute sounds great until you benchmark the latency and find out your search agents can't even match human efficiency. The intersection is real. Ninety percent of the projects aren't. MERRIN could become the yardstick by which future AI search capabilities are measured, pushing developers to innovate harder and smarter.

As AI continues to inch closer to human-level competence, MERRIN serves as a reminder of the road still ahead. Until the numbers tell a different story, humans remain the superior searchers in the digital wilderness.