Reimagining Multi-Turn Text-to-SQL: A Deep Dive into Memory and Execution
EnterpriseMem-Bench redefines multi-turn Text-to-SQL with innovative evaluation. The study challenges assumptions about memory architectures and execution accuracy.
Enterprise analytics are constantly evolving, and multi-turn Text-to-SQL is at the forefront of this shift. The latest EnterpriseMem-Bench introduces a groundbreaking benchmark comprising 300 sessions and 1,400 turns from domains such as BIRD financial, SEC EDGAR, and Northwind. This benchmark shakes up the usual single-turn evaluations, presenting deterministic ground truth and detailed memory-critical annotations for each turn.
Challenging the Status Quo
In the bid to unlock the potential of Text-to-SQL, researchers evaluated five advanced models: GPT-5 mini, GPT-5.2, Claude Sonnet 4.5, Sonnet 4.6, and Opus 4.6. These were tested across five distinct memory conditions, employing a three-way ablation to isolate the effects of working-memory window size, episodic retrieval, and semantic augmentation. It's evident that extended thinking is necessary for the Claude models to compete on equal footing with GPT's reasoning abilities.
The findings are striking. Stateless multi-turn Text-to-SQL sees execution accuracy plummet to zero by the third turn across all models. This suggests that the current infrastructure is woefully inadequate for handling multi-turn interactions effectively. The AI-AI Venn diagram is getting thicker, and it's clear we're only scratching the surface of what's possible.
Memory Complexity: More Isn't Always Better
An intriguing discovery is that increased memory-architecture complexity doesn't linearly translate to better performance. In fact, working memory takes precedence, overshadowing the benefits of additional complexities. Depending on the model and dataset, these additional components could either boost accuracy by 14 percentage points or hinder it by 16. It begs the question: are we overcomplicating what should be a straightforward process?
the study highlights a surprising generational regression. Claude Sonnet 4.6 underperforms its predecessor Sonnet 4.5 by 17 to 33 percentage points on the SEC EDGAR dataset. Even with reasoning enabled, this regression persists, indicating a fundamental issue that needs addressing.
The Role of Reasoning in Error Distribution
Under reasoning conditions, another pattern emerges. Claude models show a mono-modal error distribution. Quite simply, every turn that isn't correct results in a wrong-result error. This isn't just a technical curiosity, it's a call to action for developers and researchers to reconsider how errors are handled and rectified in these systems. If agents have wallets, who holds the keys? The compute layer needs a payment rail to process these interactions efficiently.
EnterpriseMem-Bench isn't just a new benchmark. it's a convergence of ideas challenging us to rethink the computational plumbing of multi-turn Text-to-SQL. We must ask ourselves: are our current systems equipped to handle this complexity, or is it time for a more sophisticated approach? The data speaks for itself, and it seems there's still a long way to go before we achieve true autonomy in this space.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A standardized test used to measure and compare AI model performance.
Anthropic's family of AI assistants, including Claude Haiku, Sonnet, and Opus.
The processing power needed to train and run AI models.
The process of measuring how well an AI model performs on its intended task.