How AI Raters Skew Clinical Evaluations: A Closer Look

Artificial intelligence is transforming clinical evaluations, but not always as expected. adult type 2 diabetes pharmacotherapy checks, AI raters employed large language models (LLMs) to score clinical decisions. Yet, the scoring behavior of these models remains underexplored. New research sheds light on this, revealing notable biases and suggesting a need for rubric-based protocols.

The Role of LLMs in Clinical Settings

In a study examining clinical decision support systems (CDSS) across 12-month evaluations for diabetes treatments, four open-source LLMs took center stage. These models acted as both decision support and AI raters. Their outputs were evaluated using two distinct protocols: a rubric-anchored Gold Rubric (GR) protocol and a rubric-free Non-Gold Rubric (Non-GR) protocol.

Here's what the benchmarks actually show: under the Non-GR protocol, AI raters consistently awarded higher scores, averaging between 74 and 78 points. In contrast, scores under the GR protocol plummeted, showing a difference of up to 49.64 points lower. This significant gap highlights the importance of using patient-specific rubrics to ensure accurate evaluations.

Why Rubrics Matter

Rubrics aren't just bureaucratic checklists. They provide a structured, standardized approach that seems to foster more accurate and discriminative scoring. In fact, the GR protocol amplified disparities between different CDSS outputs by a factor of 1.76 to 5.10. It's clear: the architecture matters more than the parameter count.

Non-GR protocols, on the other hand, suppressed key behavioral variations among rater models. This raises an important question: Can we rely on AI to make nuanced, patient-specific decisions without guidance? The numbers tell a different story.

The Path Forward

Strip away the marketing and you get a sobering realization: rubric-free scoring is inadequate for complex clinical evaluations. It can't replace protocols that demand jurisdiction-specific criteria, which AI models can't infer simply from data.

The reality is, if we want AI to contribute meaningfully to healthcare, we must insist on scoring methods that enhance rather than diminish accuracy. While technology advances, human oversight remains indispensable. In healthcare, that's non-negotiable.

How AI Raters Skew Clinical Evaluations: A Closer Look

The Role of LLMs in Clinical Settings

Why Rubrics Matter

The Path Forward

Key Terms Explained