BlueFin: A Benchmark Holding AI to Account in Finance

In the professional finance domain, spreadsheets are ubiquitous, yet the capabilities of large language models (LLMs) in handling these complex tasks are under-explored. Enter BlueFin, a new benchmark designed to put LLMs through their paces with spreadsheet tasks that mirror real occupational demands.

The BlueFin Benchmark

BlueFin introduces 131 challenging tasks, curated to reflect the complexity and real-world relevance that finance professionals face. Comprising 3,225 granular criteria, these tasks are meticulously validated by expert human annotators. The specification is as follows: each task is evaluated by a language model (LM) judge, achieving a parity score of 0.826 with expert consensus and a macro-F1 score of 0.839. The benchmark sets a high bar for LLMs, revealing their struggles, particularly in dynamic correctness.

LLMs Under Pressure

Despite the widespread use of spreadsheet software by hundreds of millions globally, AI’s performance in this domain is lagging. Frontier LLMs, including some of the strongest available, score below 50% on average across BlueFin's tasks. This exposes a significant gap between AI's current capabilities and the nuanced demands of real-world finance tasks.

Why It Matters

Why should the finance sector care about an AI benchmark? For starters, it highlights the need for more resources and development focus on LLMs tailored for the finance industry. With so many relying on spreadsheets for critical tasks, AI that can handle these efficiently isn't just a luxury but a necessity. The implications extend beyond efficiency - they touch on accuracy and reliability, core tenets of finance.

The upgrade introduces three modifications to the execution layer, yet backward compatibility is maintained except where noted below. The challenge is clear: can AI catch up to the demands of professional finance, or will humans continue to outperform in these complex tasks?

The Road Ahead

BlueFin's contributions include not just the dataset but also an open-source framework for agentic evaluation. This initiative is a call to action for AI developers. It's time to advance AI’s ability to tackle spreadsheet tasks, ensuring they're more than just novelty applications. The finance sector stands to benefit greatly from improved LLM capabilities, but it starts with recognizing the gap and putting in the work to bridge it.