FinTrace: The New Benchmark Shaking Up Financial AI

JUST IN: The world of financial AI just got a new benchmark, and it's called FinTrace. It's a wild addition to the landscape, shaking up how we evaluate large language models (LLMs) in financial tasks. Forget old-school call-level metrics. FinTrace dives deep, focusing on trajectory-level reasoning across different financial scenarios.

Why FinTrace Matters

Here's the scoop: FinTrace brings together 800 expert-annotated trajectories covering 34 real-world financial tasks, all with varied difficulty levels. This isn't just about testing LLMs on basic tasks. It's about seeing how they handle complex, long-horizon financial scenarios. And the results? Eye-opening, to say the least.

FinTrace evaluates LLMs using a rubric-based protocol with nine metrics spread over four axes: action correctness, execution efficiency, process quality, and output quality. It's a comprehensive approach that paints a clearer picture of how well these models really perform.

The Results: A Mixed Bag

Our evaluation of 13 major LLMs reveals something key. While they excel at selecting the right tools, there's a massive gap in their ability to effectively use the information they obtain. It's a bit like having a toolbox but not knowing what to do with the tools inside. And just like that, the leaderboard shifts.

The labs are scrambling to figure out why end-to-end answer quality isn't improving, despite better intermediate reasoning. It's a conundrum that FinTrace has brought to light.

Training for the Future

Enter FinTrace-Training, the first dataset aimed at improving trajectory-level preference for financial tool-calling. It features 8,196 curated trajectories, complete with tool-augmented contexts and preference pairs. By fine-tuning Qwen-3.5-9B using this data, researchers have shown there's room for improvement.

Direct preference optimization (DPO) is proving effective in suppressing failure modes, but there's still a bottleneck in the final output quality. So, is FinTrace the answer to all our problems? Probably not. But it's a step in the right direction, forcing us to confront the limitations of current AI models in the financial sector.

The Hot Take

Let's be clear: FinTrace is shaking things up. It's forcing big labs to go back to the drawing board and rethink how they evaluate and improve LLMs. But will it lead to a new era of super-intelligent financial AI? It's too soon to tell. However, one thing's for sure, ignoring these gaps is no longer an option.