Agent Development Kits: A Double-Edged Sword in AI Model Building
The surge in Agent Development Kits (ADKs) offers both promise and complexity. A new study evaluates 51 ADKs, revealing their varying efficiency and API usability.
The flood of Agent Development Kits (ADKs) has been relentless, each promising to simplify the creation of autonomous agents powered by large language models (LLMs). However, the question of how these kits stack up against one another has largely gone unanswered, until now.
The LLM-as-a-Developer Framework
Enter 'LLM-as-a-Developer,' a methodology that swaps out human developers for LLM coding agents. These agents learn each framework's API from documentation, code, and iterate until tests are passed. By keeping the developer constant while changing only the framework, the method provides a quantitative look into API usability and framework effectiveness.
This isn't just a theoretical exercise. They've implemented it in what they're calling the 'ADK Arena,' a fully automated system that isolates each framework using Docker. It features a three-level validation pipeline and benchmark adapters for established tests like SWE-bench and MCP-Atlas.
The Findings: Complexity and Costs
The study evaluated 51 popular Python ADK frameworks, creating a staggering 204 agent-benchmark pairs. The results? Success was achieved in 57% of runs, with generation costs swinging wildly from $0.6 to $3.4 per agent. API complexity is a major driver, but interestingly, cost alone doesn't predict success. This suggests that slapping a model on a GPU rental isn't a convergence thesis.
No framework was a runaway winner. The most successful agents solved up to 80% of tasks on individual benchmarks and even outperformed general-purpose coding agents, all at a fraction of the cost. Yet, for most frameworks, the median resolution rate was a mere 32%.
Documentation: A Substitute, Not a Solution
Perhaps the most eye-opening finding was on the substitutability of information sources. Whether provided raw source access or no reference material at all, success rates hovered between 28-40%. The takeaway? Documentation, source code, and parametric knowledge are more interchangeable than you'd think. This blows apart the notion that more documentation always translates to better outcomes.
So, where does this leave us? The intersection of ADKs and LLMs is undeniably real, but the vast majority of these projects aren't ready for prime time. If the AI can hold a wallet, who writes the risk model? The complexity and cost variations indicate that the industry has a long way to go before these tools offer the easy integration that developers crave.
And let's not forget: show me the inference costs. Then we'll talk.
Get AI news in your inbox
Daily digest of what matters in AI.