Cracking the Code: OpenClaw's Struggle and Success in...

General-purpose agents like OpenClaw are becoming more common autonomous coding. Yet, their real challenge lies in performing well under benchmarks like SWE-bench. OpenClaw alone struggles to meet the clean Docker workspace, patch, and prediction contract needed for proper scoring. Enter Claw-SWE-Bench, a benchmark designed to make these agents comparable under fair and consistent conditions.

Breaking Down Claw-SWE-Bench

Claw-SWE-Bench isn't just another test. It's a multilingual benchmark that includes a fixed prompt, a runtime budget, workspace contract, and a patch extraction procedure. With 350 GitHub issue-resolution instances across 8 languages and 43 repositories, this benchmark pulls from SWE-bench-Multilingual and SWE-bench-Verified-Mini. It's about thoroughness and fairness in evaluation.

For those wanting quicker validation, Claw-SWE-Bench Lite offers an 80-instance subset. This subset is carefully selected using a cost-aware, rank-aware process across 17 calibration columns. It's the quick and dirty approach needed for faster checks without losing the essence of the full test.

OpenClaw's Performance Overhaul

OpenClaw with a minimal direct-diff adapter initially scores only 19.1% Pass@1 on the full benchmark. That's a rough start. However, when equipped with a full adapter, it leaps to 73.4% while using the same GLM 5.1 backbone. This stark contrast highlights how important adapter design is for these coding agents. Why settle for less when a well-designed adapter can transform performance?

Across various model and harness setups, the Pass@1 metric changes significantly. A nine-model sweep with OpenClaw shows a 29.4 percentage point variation. Meanwhile, five different harnesses change Pass@1 by 27.4 percentage points under fixed models. Clearly, choosing the right harness and model can lead to substantial savings in total API costs. Smart choice, substantial impact.

Why Developers Should Care

Claw-SWE-Bench isn't just a tool for evaluation. It treats harness and cost accounting as fundamental parts of coding-agent evaluation. With both a full benchmark and a low-cost reference set, it offers a reliable way to assess coding agents. Who wouldn't want a benchmark that puts everything in perspective?

If you're looking to measure and improve agent performance, check out the data at GitHub and Hugging Face. Clone the repo. Run the test. Then form an opinion. It's all about seeing the bigger picture and knowing where your agent stands.

Cracking the Code: OpenClaw's Struggle and Success in SWE-Bench Tests

Breaking Down Claw-SWE-Bench

OpenClaw's Performance Overhaul

Why Developers Should Care

Key Terms Explained