Decoding Large Language Models in Clinical Text Extraction

Large language models (LLMs) are reshaping how we extract information from clinical notes. But how does the choice of model or prompt impact the results? That's what a recent study aims to find out. By focusing on extracting specific clinical information and varying one configuration at a time, researchers have started to unravel this complexity.

The Experiment

The researchers used 17 clinical flags and a 47-tag vocabulary to extract information from MIMIC-IV v3.1 discharge summaries. They tested three prompt variants across two model sizes to see how these factors influenced the results. With Cohen's kappa measuring agreement, they aimed to nail down where inconsistencies arise.

Findings? Well, yes/no/not_documented flags, the models shared similar agreement levels. The median kappa scores hovered around 0.69 and 0.68. The bigger model did better on some fields, worse on others. So, it's not just about the model size, but how it redistributes its agreements and disagreements. Intriguing, right?

Where It Gets Practical

When they simplified the schema to binary options, most disagreements melted away. It turns out, most of the noise came from distinguishing between absence and silence, rather than confirming a finding's presence. For multi-class categorization, changing the model shifted dominant tags in about half the notes. Prompt phrasing mattered less, affecting only one in eight notes. Interestingly, larger models didn't lean on catch-all categories as much, dropping from 44% to 26%.

Here's the catch: The schema's complexity might be creating more divergence than the models themselves. In practice, this means if you're looking to deploy these models in a real-world setting, simplifying your schema might do wonders. But don't get complacent. the real test is always the edge cases.

Why It Matters

Why should we care? Well, in healthcare, accuracy can make or break a diagnosis. Understanding how LLMs behave isn't just academic. It's critical for anyone planning to deploy these systems at scale. The demo is impressive, but the deployment story is messier. If a model's choice can sway results this much, we've got to rethink how we trust automated extractions. Are we ready to let AI make these calls, or should we keep a closer human eye?