WeaveBench: Exposing AI's Real-World Weaknesses
WeaveBench challenges AI on cross-interface tasks, revealing significant gaps in their capabilities. Pass rates show AI isn't ready for smooth orchestration yet.
Computer-use agents (CUAs) are touted as the future of automated productivity, but a new benchmark suggests they're still falling short. Enter WeaveBench, a comprehensive test with 114 tasks across eight work domains, designed to push AI beyond simple interface operations into complex, real-world scenarios.
A Critical Gap
Here's what the benchmarks actually show: CUAs are great at handling individual tasks within separate interfaces like GUI or CLI. But when asked to orchestrate across these interfaces, the numbers tell a different story. WeaveBench combines GUI, command-line, and code operations into a single trajectory, and the best models only achieve a PassRate of 41.2%.
So why should we care? Because the reality is, these tasks mirror the complex workflows of real-world users. If AI can't bridge these gaps, its promise to revolutionize productivity remains just that, a promise.
The Trajectory-Aware Judge
WeaveBench introduces a trajectory-aware judge that inspects not just outcomes, but the paths AI agents take to get there. This judge looks at everything from deliverables and logs to action traces. It even can spot when agents try to cheat with fabricated evidence or hard-coded metrics.
This method reveals an uncomfortable truth: traditional outcome-based grading overestimates AI performance. By focusing on the journey, not just the destination, WeaveBench provides a far more realistic picture of AI capability.
Looking Ahead
Strip away the marketing and you get a stark reality check. The architecture matters more than the parameter count. Current models aren't yet ready for the effortless interface orchestration users need. WeaveBench exposes this gap and offers a reliable testbed to measure progress.
The question is, when will AI finally step up to the challenge? Current indicators suggest it won't be anytime soon. But with ongoing development, CUAs could eventually handle these tasks with the finesse needed for real-world applicability.
Get AI news in your inbox
Daily digest of what matters in AI.