AI Coding Tools: Are They Broken by Default?

AI-powered coding assistants have become mainstays in the development toolkit, especially in sectors where security is non-negotiable. But are these tools ready for prime time generating secure code? A recent study suggests they may not be.

The Study

In a comprehensive formal verification study dubbed 'Broken by Default,' researchers took a hard look at 3,500 code artifacts crafted by seven leading large language models (LLMs). Using 500 carefully chosen security-critical prompts, they explored the exploitability of the code generated. The results were eye-opening.

Across the board, 55.8% of these code outputs harbored at least one vulnerability identified by the COBALT analysis pipeline. In fact, 1,055 of these vulnerabilities were conclusively proven using the Z3 SMT solver. Now, you'd expect that the top models might fare better, right? Not quite. GPT-4o led the pack, but not in a good way, 62.4% of its outputs were flawed. Meanwhile, Gemini 2.5 Flash was the 'best' performer, yet still managed a grade D with 48.4% of its artifacts being vulnerable.

Why It Matters

So, why should anyone care? Well, think about it. If you're relying on AI to write secure code for critical systems, having over half the results potentially exploitable is a serious red flag. These coding assistants are meant to save time and reduce errors, but they're introducing new risks instead. It's a classic case of the pitch deck saying one thing while the product says another.

Interestingly, the study also revealed that even when models were given explicit security instructions, the vulnerability rate dropped by just 4%. That’s a drop in the bucket. Plus, industry-standard tools missed nearly 98% of the vulnerabilities that were mathematically proven. This raises an important question: are we prioritizing speed and convenience over something as essential as security?

The Bigger Picture

What's the takeaway here? AI coding tools can be a helpful aid, but they’re far from infallible. The founder story might be compelling, but the metrics are more interesting. If over half the generated code is flawed, there's a fundamental issue that needs addressing. Given that these models can identify their own vulnerable outputs 78.7% of the time in review mode, but still generate them 55.8% of the time, there's a significant gap between capability and execution.

So, where does that leave developers? In the trenches, still tasked with the due diligence of code review and testing. AI might not be ready to take the reins completely, but it can certainly be a solid co-pilot when used wisely. But perhaps the real story here's a call to arms for better integration of AI with existing security protocols. Until then, the grind continues.