Medical Vision-Language Models: Are They Really Learning?

Medical vision-language models (VLMs) are under scrutiny for their actual learning capabilities. These models are often evaluated on public benchmarks with images and question-answer pairs available online for years. Yet, reports assume these were absent during pretraining. But is that truly the case?

Unveiling the Overlap

In a recent audit of open VLMs like SLAKE-En and PathVQA, researchers discovered measurable overlaps with existing datasets. SLAKE-En, for instance, showed 19.8% of images flagged under a detector called SigLIP-B-16. Meanwhile, SigLIP-SO400M flagged 4.2% of images. Out-of-domain controls, by comparison, produced no such flags. This suggests not direct pixel-level memorization but a significant source or distributional overlap. Is it possible these models aren't as fresh as we think?

Textual Signals and Their Reliability

On the textual front, the model Qwen2.5-VL on SLAKE-En presented a canonical-order exchangeability signal, maintaining its integrity despite ordering ablation tests. Interestingly, on the OmniMedVQA mirror, this signal appeared for both medical and general VLMs, except BLIP-2 which remained unaffected. However, the cohort-relative Min-K%++ tail enrichment and cross-model top-K overlap didn't hold up. They collapsed under an external pre-domain baseline, with BLIP-2 reproducing expected signals despite lacking plausible medical-VQA exposure. This raises doubts about the reliability of these detectors as standalone signals.

Implications for Future Evaluations

So, what does this mean for future evaluations of medical VLMs? For one, it questions the validity of current benchmark results. If models are inadvertently pre-exposed to datasets, reported accuracies might not reflect genuine learning but rather a regurgitation of previously seen data. Shouldn't we be more critical of these benchmarks?

The FDA pathway matters more than the press release in these situations. Understanding the true capabilities of these models is essential for their implementation in clinical settings. Surgeons I've spoken with say these tools could revolutionize diagnostics, but only if they're genuinely innovative. The key takeaway? As the field evolves, more rigorous and transparent evaluation methods will be essential to ensure these models deliver real-world benefits.

Medical Vision-Language Models: Are They Really Learning?

Unveiling the Overlap

Textual Signals and Their Reliability

Implications for Future Evaluations

Key Terms Explained