WeaveBench: Exposing AI's Real-World Weaknesses

Computer-use agents (CUAs) are touted as the future of automated productivity, but a new benchmark suggests they're still falling short. Enter WeaveBench, a comprehensive test with 114 tasks across eight work domains, designed to push AI beyond simple interface operations into complex, real-world scenarios.

A Critical Gap

Here's what the benchmarks actually show: CUAs are great at handling individual tasks within separate interfaces like GUI or CLI. But when asked to orchestrate across these interfaces, the numbers tell a different story. WeaveBench combines GUI, command-line, and code operations into a single trajectory, and the best models only achieve a PassRate of 41.2%.

So why should we care? Because the reality is, these tasks mirror the complex workflows of real-world users. If AI can't bridge these gaps, its promise to revolutionize productivity remains just that, a promise.

The Trajectory-Aware Judge

WeaveBench introduces a trajectory-aware judge that inspects not just outcomes, but the paths AI agents take to get there. This judge looks at everything from deliverables and logs to action traces. It even can spot when agents try to cheat with fabricated evidence or hard-coded metrics.

This method reveals an uncomfortable truth: traditional outcome-based grading overestimates AI performance. By focusing on the journey, not just the destination, WeaveBench provides a far more realistic picture of AI capability.

Looking Ahead

Strip away the marketing and you get a stark reality check. The architecture matters more than the parameter count. Current models aren't yet ready for the effortless interface orchestration users need. WeaveBench exposes this gap and offers a reliable testbed to measure progress.

The question is, when will AI finally step up to the challenge? Current indicators suggest it won't be anytime soon. But with ongoing development, CUAs could eventually handle these tasks with the finesse needed for real-world applicability.

WeaveBench: Exposing AI's Real-World Weaknesses

A Critical Gap

The Trajectory-Aware Judge

Looking Ahead

Key Terms Explained