CARTBENCH: A New Benchmark Challenges Vision-Language Models on Chinese Art
Introducing CARTBENCH, a new benchmark designed to test vision-language models on Chinese artworks. The benchmark reveals current model limitations in connoisseur-level reasoning and authenticity discrimination.
The AI-AI Venn diagram is getting thicker with the introduction of CARTBENCH, a museum-grounded benchmark that pushes vision-language models (VLMs) beyond the basics. Focused on Chinese artworks, CARTBENCH challenges the models' capabilities across four distinct tasks, exposing the limitations that remain in their understanding and interpretation of art.
The Tasks at Hand
CARTBENCH comprises four subtasks: CURATORQA, CATALOGCAPTION, REINTERPRET, and CONNOISSEURPAIRS. Each subtask is designed to test a different aspect of the models' cognitive abilities.
CURATORQA focuses on evidence-grounded recognition and reasoning, pushing models to link hard evidence with visual content. CATALOGCAPTION requires generating structured, expert-style appreciation, while REINTERPRET demands defensible reinterpretation rated by experts. Lastly, CONNOISSEURPAIRS challenges models with diagnostic authenticity discrimination under visually similar confounds, a task that separates mere pattern recognition from true understanding.
Revealing Shortcomings
Analysis across nine representative VLMs reveals a troubling trend. While some models initially display high accuracy in CURATORQA, they falter under pressure. Sharp drops in performance emerge when models are tasked with the more nuanced challenges of linking evidence and inferring styles to periods. Long-form appreciation through CATALOGCAPTION remains far from expert benchmarks, underscoring the gap between machine-generated content and human expertise.
Perhaps most concerning is the performance in CONNOISSEURPAIRS, where authenticity discrimination hovers near chance levels. If agents have wallets, who holds the keys? This rhetorical question hints at the need for deeper agentic understanding rather than surface-level processing.
Why CARTBENCH Matters
Why should we care about a benchmark for Chinese art? Because it highlights the critical flaws in current VLM capabilities. Connoisseur-level reasoning isn't just a luxury, it's a necessity for applications ranging from art curation to historical preservation. As we increasingly rely on AI for these tasks, CARTBENCH serves as a reminder of the compute layer's urgent need for improvement.
This isn't a partnership announcement. It's a convergence of art and AI, a collision that reveals both potential and peril. We're building the financial plumbing for machines, but if they can't discern authenticity, what value do they truly add?
Get AI news in your inbox
Daily digest of what matters in AI.