AI Model Scaffolds: The Hidden Factor in Capability Scores

AI model evaluations have long been plagued by a hidden variable: the scaffolding that supports them. Recent findings turn assumptions on their head, showing that scaffold configurations can sway a model's capability scores by up to 28 percentage points. These numbers aren't just statistical noise. they're indicative of a much larger issue.

Scaffold Sensitivity Unveiled

In a controlled comparison, researchers tested three distinct scaffolds, ReAct, a Planner-Actor-Rater multi-agent design, and a planner-then-executor setup, across models from Anthropic's Claude Opus 4.7 to OpenAI's GPT-5.5. Holding conditions constant, the results revealed a staggering 28-point swing in accuracy within a single level for Opus. Models didn't just differ in capability. they varied in how scaffolding influenced their performance.

The pre-registered hypothesis that advanced models would resist scaffold influence didn't hold water. In fact, Anthropic's top model showed the most gain from structured scaffolds at higher difficulty levels. Is this revealing a fundamental flaw in the way we categorize AI capability?

Challenging the Single-Scaffold Narrative

While single-scaffold capability scores might seem straightforward, they're anything but. The multi-agent setup outperformed ReAct on Level 2 within Anthropic's lineup, yet this advantage didn't transfer to cross-provider models. This makes model family, not capability tier, the true differentiating factor. Forget tier-scaling as a reliable metric under these conditions.

If you think the planner-executor scaffold has a leg up on file-reading tasks, think again. This presumed edge was essentially debunked. Meanwhile, structured scaffolds not only reduced tool calls but also recovered from errors more effectively at tougher levels. The star performer? Gemini, with its planner-then-executor scaffold, proved both cost-effective and highly accurate at the advanced stage.

The Real Cost of Ignoring Scaffolds

Why should industry insiders care? Because single-scaffold performance numbers are misleading. They're not guaranteed to narrow as models grow more sophisticated. If the AI can hold a wallet, who writes the risk model? AI developers and stakeholders can't ignore the scaffold's impact without risking skewed assessments and misguided investments.

It's high time we reassessed how we interpret AI capabilities. Slapping a model on a GPU rental isn't a convergence thesis. Scaffolds are more than mere frameworks. they're integral to understanding and deploying AI effectively. The intersection is real. Ninety percent of the projects aren't.