FinWorkBench: Testing AI's Finance Workflow Mettle
New benchmark FinWorkBench challenges AI with the intricate world of finance workflows. But can these agents handle real-world messiness?
AI agents might ace games and quizzes, but the complex terrain of finance workflows, the story's different. Enter FinWorkBench, or Finch, a new benchmark designed to push AI through the gauntlet of real-world finance and accounting tasks.
The Challenge
FinWorkBench isn’t your average test. It’s built from a whopping 15,000 files and 500,000 emails from real workplaces, like Enron and other finance giants, covering data from 2000 to 2025. This isn’t some neat lab environment. It’s packed with the chaotic mix of tables, charts, and documents that make up real finance work.
What's in a Workflow?
Finch's workflow construction is no small feat. It involves extracting workflows from real email threads and spreadsheet histories. Imagine sifting through 1,710 spreadsheets with 27 million cells and annotating them, a task that took over 700 hours of expert effort. That’s dedication. But why does it matter? Simple. These workflows capture the intense, collaborative, and often messy nature of enterprise work. They’re long, knowledge-heavy, and demand more than just simple calculations.
AI Performance Under Scrutiny
FinWorkBench puts AI systems like GPT 5.1, Claude Sonnet/Opus 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max under the microscope. The results? Let’s just say the AI still has a long way to go. For instance, GPT 5.1 Pro takes an average of 16.8 minutes per workflow but only nails 38.4% of them. That’s not exactly confidence-inspiring, is it?
In practice, these AI tools face real challenges when dealing with workflows that look different from their training data. The demo may be impressive, but the deployment story is messier. If these agents are to be truly useful, they need to tackle these edge cases better.
Why Should We Care?
So, why’s this important for you? If you’re in finance, AI that can handle workflows like those in Finch can make your life easier, or at least that’s the promise. The real test is always the edge cases. Can AI handle the unexpected twists of real-world data? That’s the big question, and right now, the results are mixed.
For AI developers and businesses, FinWorkBench is a wake-up call. It’s a reminder to focus not just on building flashy demos but on solving the nitty-gritty challenges of real-world deployment. Because in production, this looks different.
Get AI news in your inbox
Daily digest of what matters in AI.