Are AI Models Really Immune to Prompt Injection Attacks?

Recent chatter in the AI community has centered around claims that new computer-using-agent (CUA) models are largely immune to prompt-injection attacks. Reports tout success rates of between 42% and 98% for older models. But, frankly, these numbers obscure more than they reveal. They focus on outdated models or cherry-pick the most vulnerable ones to showcase their success.

Unpacking the Numbers

Let's apply some rigor here. The latest research benchmark, CUA-HandCrafted, tested 793 episodes across 24 web tasks using 56 attack templates. The study targeted Claude Sonnet 4.6 and GPT-5.4 to gauge how well they fend off such attacks. The result? A resounding zero out of 140 attempts succeeded. The statistical Clopper-Pearson 95% upper bound for success was pegged at just 2.60%. That sounds impressive, but does it tell the full story?

What they're not telling you: these models don't exhibit this resilience across all scenarios. A sister benchmark, SkillBench, revealed a glaring vulnerability. Using the same model weights, up to 100% of the hand-crafted skill-injection attacks were successful. Color me skeptical, but it seems the perceived robustness might not be as universal as the headlines suggest.

The Real Threat: RL-Optimized Strings

The high success rates touted in the literature may owe more to reinforcement learning (RL) optimized injection text than the intrinsic strength of the attack categories. This points to a fundamental flaw in the current reporting methodologies. Without releasing the optimized attack strings, claims of security are, at best, unverifiable, and at worst, misleading.

It's one thing to claim safety in a controlled browser domain. But extrapolating these results to other domains without concrete evidence risks overfitting the narrative to a specific context. Where is the broader evaluation across varied scenarios?

Why This Matters

In an era where AI models increasingly take the helm in decision-making, the assurance of their security must be more than just smoke and mirrors. The industry can't afford to rest easy on skewed results or limited testing scenarios. One misstep could spell catastrophe in real-world applications.

The question isn't whether these models can resist attacks but how adaptable they're to evolving threats. With the AI arms race accelerating, the need for transparent and reproducible testing has never been more urgent. Can we really trust models that thrive in a laboratory but falter in the wild?