AI Model Scaffolds: The Hidden Factor in Capability Scores
New research challenges the assumption that more advanced AI models are less affected by scaffolding. Instead, scaffold choice greatly impacts accuracy, revealing a hidden dimension in AI evaluations.
AI model evaluations have long been plagued by a hidden variable: the scaffolding that supports them. Recent findings turn assumptions on their head, showing that scaffold configurations can sway a model's capability scores by up to 28 percentage points. These numbers aren't just statistical noise. they're indicative of a much larger issue.
Scaffold Sensitivity Unveiled
In a controlled comparison, researchers tested three distinct scaffolds, ReAct, a Planner-Actor-Rater multi-agent design, and a planner-then-executor setup, across models from Anthropic's Claude Opus 4.7 to OpenAI's GPT-5.5. Holding conditions constant, the results revealed a staggering 28-point swing in accuracy within a single level for Opus. Models didn't just differ in capability. they varied in how scaffolding influenced their performance.
The pre-registered hypothesis that advanced models would resist scaffold influence didn't hold water. In fact, Anthropic's top model showed the most gain from structured scaffolds at higher difficulty levels. Is this revealing a fundamental flaw in the way we categorize AI capability?
Challenging the Single-Scaffold Narrative
While single-scaffold capability scores might seem straightforward, they're anything but. The multi-agent setup outperformed ReAct on Level 2 within Anthropic's lineup, yet this advantage didn't transfer to cross-provider models. This makes model family, not capability tier, the true differentiating factor. Forget tier-scaling as a reliable metric under these conditions.
If you think the planner-executor scaffold has a leg up on file-reading tasks, think again. This presumed edge was essentially debunked. Meanwhile, structured scaffolds not only reduced tool calls but also recovered from errors more effectively at tougher levels. The star performer? Gemini, with its planner-then-executor scaffold, proved both cost-effective and highly accurate at the advanced stage.
The Real Cost of Ignoring Scaffolds
Why should industry insiders care? Because single-scaffold performance numbers are misleading. They're not guaranteed to narrow as models grow more sophisticated. If the AI can hold a wallet, who writes the risk model? AI developers and stakeholders can't ignore the scaffold's impact without risking skewed assessments and misguided investments.
It's high time we reassessed how we interpret AI capabilities. Slapping a model on a GPU rental isn't a convergence thesis. Scaffolds are more than mere frameworks. they're integral to understanding and deploying AI effectively. The intersection is real. Ninety percent of the projects aren't.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
An AI safety company founded in 2021 by former OpenAI researchers, including Dario and Daniela Amodei.
Anthropic's family of AI assistants, including Claude Haiku, Sonnet, and Opus.
Google's flagship multimodal AI model family, developed by Google DeepMind.
Generative Pre-trained Transformer.