Meet SLUMP: The Coding Challenge Your AI Needs to Pass
New benchmark SLUMP challenges coding AI with real-world task dynamics, revealing limitations and a potential fix with ProjectGuard.
This week in AI benchmarks: meet SLUMP. It's not your regular coding-agent test, and that's a good thing. Most coding benchmarks hand the AI the full task on a silver platter. In real life, though, coding is more of a messy, evolving conversation. Enter SLUMP, a new benchmark that aims to replicate this progressive task discovery.
What's the Deal with SLUMP?
SLUMP, which stands for 'Specification Loss Under Emergent Specification', throws 20 recent machine learning papers at the AI. We're talking 10 from ICML 2025 and 10 from NeurIPS 2025. That's a hefty challenge. Each paper breaks down into 371 tiny, verifiable parts. The catch? The AI doesn't get the full picture at once. It has to piece it together through about 60 coding requests.
To measure success, SLUMP uses a five-level faithfulness rubric. The aim is to see how well the AI can track and implement the evolving task without the final paper acting as a cheat sheet. And here's where it gets juicy: the AI is scored on its ability to retrieve these components based on interactions alone.
AI Platforms in the Spotlight
So, how did the big players do? Claude Code and Codex were put through their paces. When given a single-shot, full-specification task, both platforms performed better. Claude Code nailed it on 16 out of 20 papers, while Codex wasn't far behind with 14 out of 20. But when things shifted to the emergent specification, structural integration took a hit. Claude Code struggled more, showing a significant drop in semantic faithfulness compared to Codex.
ProjectGuard: The Saviour?
Here's where it gets interesting. Enter ProjectGuard. This tool acts like an external layer keeping the project's state in check, aiming to bridge the faithfulness gap. On Claude Code, ProjectGuard worked wonders. It closed 90% of the faithfulness gap, bumped fully faithful components from 118 to 181, and cut severe failures from 72 to 49.
Why does this matter? Because it highlights a critical weakness in coding AIs: they don't track project specifications well over time. This isn't just a tech problem. it's a usability issue. Developers need tools that can keep up with the real-world ebb and flow of projects. Are these AIs up to the task without a helper like ProjectGuard?
The Takeaway
What's the one thing to remember from this week? Specification tracking is a must-have skill for coding agents, not just a nice-to-have. SLUMP shows us that even the best AI platforms falter when they're fed tasks piece by piece. ProjectGuard's success indicates a path forward, but it also reveals a glaring gap in current coding AI capabilities.
That's the week. See you Monday.
Get AI news in your inbox
Daily digest of what matters in AI.