DiffSpot Exposes Vision-Language Models' Flaws in...

DiffSpot Exposes Vision-Language Models' Flaws in Fine-Grained Perception

By Rina ShimizuMay 29, 2026

Vision-language models struggle with subtle visual differences, as shown by DiffSpot's new benchmark. The models' limitations could impact GUI design tools.

Vision-language models (VLMs) have certainly advanced in matching images with text. Yet, they appear to struggle with fine-grained visual differences. This could pose a problem for applications like GUI agents and design tools, where distinguishing small changes is important.

Introducing DiffSpot

Enter DiffSpot, a benchmark specifically designed to highlight these shortcomings. By altering just one CSS property of a web interface and recording the results, DiffSpot creates controlled image pairs. It ensures that any change is confined to the targeted element, offering a focused test of perception.

DiffSpot comprises 4,400 pairs of images, including 3,900 with alterations spread across 13 CSS properties and three difficulty tiers. Notably, it also includes 500 pairs without differences, serving as a control against hallucinations.

Performance Analysis

The benchmark results speak for themselves. Evaluating 13 leading VLMs in zero-shot conditions, the best model could only identify 40.7% of actual changes. Particularly troubling is the Hard-tier recall, which falls below 23% for all models tested.

Why does this matter? If VLMs can't reliably detect small changes, their application in design tools could be limited. GUI agents need to perceive subtlety to function effectively. The paper, published in Japanese, reveals that current models fall short.

Why the Struggle?

DiffSpot's results also highlight that difficulty level isn't just about pixel magnitude. It's property-dependent, and neither pixel size nor CLIP distance accurately predicts recall. So, what's the future for these models in fine-grained tasks?

This is a wake-up call for researchers and developers relying on these models. VLMs need significant improvement before they're ready for prime-time use in tools requiring precise visual recognition.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.

DiffSpot Exposes Vision-Language Models' Flaws in Fine-Grained Perception

Introducing DiffSpot

Performance Analysis

Why the Struggle?

Key Terms Explained