Why Health AI Benchmarks Miss the Real Clinical Mark

Let's talk about the gap between health AI benchmarks and the actual needs in clinical settings. It's a problem that shouldn't be ignored, and it's bigger than many realize. These benchmarks, designed to validate large language models (LLMs) in healthcare, often fall short of capturing the intricate realities of clinical practice. They're missing key data and vulnerable populations, and that's a real issue if these models are ever to be truly useful in clinical scenarios.

Missing Pieces

In a recent analysis of 18,707 consumer health queries spread across six public benchmarks, a glaring misalignment was discovered. These benchmarks have indeed evolved from mere static retrieval systems to more interactive dialogue formats. Yet they're not cutting it the depth and breadth required for real-world clinical applications. The data shows that only 42% of the queries referenced objective data, and a significant portion of this was heavily skewed toward wellness-focused signals like wearables, accounting for 17.7%. More complex diagnostic inputs such as lab values and imaging are mere blips at 5.2% and 3.8%, respectively. And perhaps more concerning, raw medical records make up less than 1%.

Who Gets Left Behind?

Vulnerable populations are notably missing. Pediatrics and older adults, who often need specialized care, represent less than 11% of the queries. And let's not forget about global health needs, which are sorely underrepresented. If health AI is to be a global tool, it must embrace global diversity, not just echo the datasets of the developed world.

Safety-critical scenarios are another spot where these benchmarks stumble. Suicide and self-harm entries are less than 0.7% of the data. Chronic disease management, a vital part of ongoing healthcare, is featured in only 5.5% of the cases. If AI is to be trusted with health, shouldn't it be equipped to handle these serious concerns?

The Path Forward

Here's the thing: If these benchmarks don't adapt, they risk becoming irrelevant. The field needs to adopt a standardized query profiling system, much like what's done in clinical trials. Without this, they're leaving a lot of essential scenarios off the table. We're not just talking about a better AI model. we're talking about AI that's genuinely usable in the complexities of real healthcare settings. If AI can't handle the full spectrum of clinical practice, is it really ready at all?

In Buenos Aires, stablecoins aren't speculation. They're survival. And in healthcare, the same urgency applies. We need benchmarks that reflect the true stakes of clinical practice. Latin America doesn't need AI missionaries. It needs better rails that actually meet its needs.