The Challenges of AI Upgrades: When Stability Turns to Chaos

AI systems have become indispensable in today's business landscape, offering the promise of efficiency and automation. But what happens when a straightforward upgrade unleashes unexpected chaos?

The System That Worked, Until It Didn't

It all started with a system designed to transform natural-language questions into API calls. Users like analysts and account managers relied on this tool to simplify data requests, which previously required navigating multiple dashboards and tools. By early 2025, the system was hitting its stride, generating hundreds of reports monthly, supporting leadership and external stakeholders alike.

This system hinged on a structured JSON object, essentially acting as a contract between the language model and the system. The model, based on Claude Sonnet, smoothly transitioned from version 3.5 to 4.0. Then came the upgrade to 4.5, and with it, the unforeseen complications.

Unbounded Complexity: The Infinite Blast Radius

With 4.5, the system began misbehaving. The model started inserting unforeseen content into the JSON description field, causing key parameters to disappear. API calls were either incorrect or incomplete, generating all-time sales data or triggering errors. Even more perplexing, the model began asking clarifying questions, a behavior that the system couldn't handle.

This isn't just a story of a failed upgrade. It highlights a fundamental issue with AI systems: the infinite blast radius. Unlike traditional software engineering, where changes are bounded and predictable, AI models bring an unpredictable element. The system's reliance on a black-box model meant that even minor updates could lead to widespread disruptions.

Lessons Learned and the Path Forward

The post-mortem revealed a key lesson: assumptions about model behavior can be dangerous. The problem wasn't the model but the team's reliance on it filling in gaps. The solution? Treat evaluation suites as the formal specification, ensuring that any changes align with these benchmarks. In essence, the evals-first approach becomes the new discipline for AI development.

But here's the kicker: Evals are costly and imperfect, catching only known issues. They can't predict unforeseen problems. As AI systems evolve, engineers must grapple with new metrics of success. Is it time to rethink how we trust and deploy AI? The compliance layer is where most of these platforms will live or die.

The AI community must invest in developing strong standards for evals to close the gap between testing and production readiness. Only then can we ensure stability in a world where AI decisions increasingly mirror human choices.