Do Medical Vision-Language Models Really Know Their Stuff?

Medical vision-language models (VLMs) have been under the microscope. A recent audit uncovers significant overlaps between their training data and the public benchmarks used to assess their accuracy. The findings suggest that what we thought about these models' capabilities might not be entirely accurate.

Benchmark Overlap Unveiled

The audit examined open VLMs using four different detectors on datasets like SLAKE-En, PathVQA, and VQA-RAD. Remarkably, 19.8% of images on SLAKE-En overlapped with the training data for SigLIP-B-16, and 4.2% for SigLIP-SO400M. Out-of-domain controls, however, showed no such overlaps. This raises a critical question: Are these models memorizing data instead of learning from it?

Manual checks revealed that while these flagged images weren't pixel-by-pixel duplicates, they often came from similar sources or distributions. This suggests a form of distributional overlap rather than outright memorization. Still, it challenges the assumption that these models were trained without exposure to benchmark data.

Textual Confusion

On the textual side, things get even murkier. Qwen2.5-VL displayed a persistent signal of canonical-order exchangeability on the SLAKE-En dataset. Even when the order was altered, this signal remained, which is unusual. On the OmniMedVQA mirror, multiple medical and general VLMs showed similar signals, except for BLIP-2, which remained clean. This anomaly suggests that some models might be picking up more on data structure than content.

Unreliable Detectors?

Here's where it gets interesting. When using cohort-relative Min-K%++ tail enrichment and cross-model top-K overlap, these signals failed under an external baseline. Surprisingly, BLIP-2, despite lacking medical-VQA exposure, reproduced apparent positive signals. This indicates that these cohort-relative detectors might be unreliable for determining if a model has seen specific data.

Strip away the technical jargon, and you get a fundamental question about the reliability of the tools we use to evaluate VLMs. If these detectors can be fooled or misinterpret data exposure, then their conclusions might be flawed.

Why This Matters

Frankly, the reliability of medical VLMs is at stake. If these models are partly evaluated on data they've seen before, their reported performance could be misleading. For healthcare professionals relying on AI to interpret medical images accurately, this overestimation could have real-world consequences.

So, what's next? The audit suggests that we can't solely rely on current benchmark practices. New, more strong methods are needed to ensure that these models are assessed accurately, without any shadow of prior data exposure.