Taming AI Safety: The Complexity of Benchmark Testing

AI safety, not all benchmarks are created equal. The AI-AI Venn diagram is getting thicker, and the discrepancies between tested benchmarks and real-world applications are glaring. This isn't just about running a model through a series of tests. it's a convergence of formats, architectures, and real-world applications.

Benchmark Limitations

Consider this: A safety score earned on a benchmark doesn't guarantee how a model will behave when it's wrapped in an agentic scaffold that the benchmark never tested. Recent tests ran six frontier models through four distinct deployment configurations: direct API, ReAct, multi-agent critic, and map-reduce delegation. The results? A staggering 62,808 blinded, pre-registered, equivalence-tested evaluations across four safety benchmarks.

The findings highlight a essential point, while ReAct and multi-agent scaffolds stayed within a tight +/-2 percentage point equivalence margin, the map-reduce delegation suffered a noticeable safety degradation. But here's the kicker: much of this loss is attributed to format conversion rather than reasoning disruption itself. On identical items, the shift from multiple-choice to open-ended phrasing altered the measured safety rate by 5-20 percentage points. The compute layer needs a payment rail, but it also needs a format that maintains integrity.

Heterogeneity and Format Sensitivity

The heterogeneity between models and scaffolds is stark. Under the map-reduce configuration, Opus lost 16.8 percentage points while Llama 4 gained 18.8. This disparity underscores the minimal impact of structural scaffold architecture, which explained a mere 0.4% of outcome variance. In contrast, the choice of benchmark accounted for 45 times more variance. Is the industry placing too much emphasis on architecture over benchmark selection? If agents have wallets, who holds the keys?

The Bigger Picture

What's the takeaway? The generalizability coefficient, pegged at G = 0.000, with a bootstrap 95% confidence interval from 0.000 to 0.752, suggests that relying on a single composite safety number as a deployment criterion is futile. These are the so-called "easy cases," but more consequential properties like scheming and CBRN uplift are just as format- and scaffold-sensitive. The implications are clear: as we're building the financial plumbing for machines, the importance of testing configurations and formats can't be overstated.

This isn't just a theoretical exercise. The real question is: How much can we trust AI safety benchmarks today? As the industry pushes forward, the need for a comprehensive understanding of format and configuration impacts is key. The release of ScaffoldSafety's code, data, and prompts is a step in the right direction, but the journey toward true AI safety is only beginning.