Instructed Code Editing Benchmarks: Bridging the Gap to Real-World Deployment
Current benchmarks for instructed code editing fall short of capturing real-world usage, skewing heavily towards Python and lacking in diversity of tasks. As these benchmarks lag behind industry needs, it's time for a revamp that'll truly reflect coding assistant capabilities.
In the evolving landscape of AI-driven coding assistants, instructed code editing stands out as a significant interaction, accounting for an impressive 19% of real-world usage. Yet, the benchmarks designed to evaluate this capability seem to miss the mark in several key areas. With over 150 code-related benchmarks surveyed, only two, CanItEdit and EDIT-Bench, focus specifically on instructed code editing with human-authored instructions and test-based evaluation.
The Scope of Current Benchmarks
Both CanItEdit and EDIT-Bench have been scrutinized, with their programming languages, edit intents, and application domains compared against real-world distributions observed in platforms like Copilot Arena, AIDev, and GitHub Octoverse. A glaring oversight? Over 90% of evaluation is concentrated on Python, completely neglecting TypeScript, GitHub's most-used language. How can we expect these benchmarks to be a true reflection of industry needs when they ignore such a widely deployed language?
The benchmarks overlook significant areas like backend and frontend development, which account for a staggering 46% of real-world editing tasks. Even more perplexing is the complete absence of documentation, testing, and maintenance edits, which represent 31.4% of human pull requests. Aren't these the very tasks we should be focusing on to gauge the true utility of coding assistants?
Examining the Numbers
Looking at the numbers, CanItEdit boasts a median of 13 tests, while EDIT-Bench lags behind with a median of just 4. However, CanItEdit compensates through comprehensive whole-file coverage and a validation process that ensures edits are both necessary and sufficient. In contrast, 59% of EDIT-Bench's tests fail to detect modifications beyond the edit region, raising questions about their efficacy.
Intriguingly, 15 problems in EDIT-Bench remain unsolved by any of the 40 language models tested, with 11 of these failures tracing back to poor benchmark artifacts rather than any inherent model limitations. Additionally, 29% of EDIT-Bench problems and 6% of CanItEdit problems share a codebase with at least one other problem, indicating a lack of diversity and challenge in the test suite.
The Need for a New Approach
What does this mean for the future of instructed code editing? The benchmarks, as they stand, measure a narrower construct than what deployment decisions require. It's clear that these tools aren't just misaligned with real-world demands, they're outdated. The industry needs benchmarks that reflect true editing capability, capturing the diversity and complexity of tasks developers face.
In response, it's proposed that a fresh set of empirically grounded desiderata be established, alongside the release of all audit artifacts to aid in building benchmarks that genuinely reflect real-world needs. After all, tokenization isn't a narrative. It's a rails upgrade waiting to happen in the space of code editing.
Get AI news in your inbox
Daily digest of what matters in AI.