QUIET: A New Benchmark Challenging AI Creativity
QUIET introduces a novel approach to evaluating creative capacity in large language models through an automated, multi-blank story cloze test.
Creative capability in large language models (LLMs) hits a wall with existing benchmarks. Most measure a model's knack for narrative continuation via multiple-choice formats. They fail to evaluate creative generation directly. Critics argue these methods, like rubric-based scoring, lack objective, automated mechanisms.
Introducing QUIET
QUIET, or Quality Understanding via Interlocked Evaluation Testing, offers a fresh take. It's a diagnostic benchmark designed to push the boundaries of LLM creativity. The setup? A multi-blank cascaded story cloze test with 10-20 blanks. Each blank is part of a complete narrative structure, each with explicit content constraints.
The blanks are interdependent. The content filled in earlier blanks narrows the solution space for later ones. This cascade dependency mimics how we naturally build narratives, with each element influencing the next. What's clever here's the open-ended generation mode for filling blanks, a nod to real-world creativity challenges.
Automated Scoring Mechanism
Now, how do we score creativity? The paper's key contribution: an information-theoretic automated scoring protocol. This protocol leans on the 'calibrated surprise' theoretical framework by Zou and Xu, 2026a. For each blank, a composite score emerges: score = satisfy * (1 + lambda * surprise), where lambda is 1.0.
'Satisfy' measures logical adherence to content constraints, steering clear of subjective biases. 'Surprise' assesses the unexpectedness, given constraints are met. Responses that fail to satisfy constraints score zero. Those that conform yet lack ingenuity score low. Meeting constraints with surprising flair earns high marks.
Why It Matters
Why should we care about QUIET? This approach might revolutionize how we gauge creativity in AI. By championing objective measures, it could bridge the gap between human-like creativity and rigid algorithmic thinking. But here's the big question: Can a machine ever truly surprise like a human?
Yet, some skeptics might argue this is merely a step toward quantifying an inherently subjective trait. Still, it's a step worth taking. If we ever want AI to contribute genuinely creative solutions in fields like art or storytelling, we need benchmarks like QUIET to set the bar high.
Get AI news in your inbox
Daily digest of what matters in AI.