The Calibration Dilemma in Multimodal AI for Cancer Prognosis
Multimodal AI models combining histopathology and genomics show promise in cancer prognosis. Yet, their calibration remains questionable, challenging their clinical utility.
Multimodal AI models that fuse histopathology images with genomic data are making waves in cancer survival prediction. But here's the catch: while they boast impressive discriminative power, especially evident in metrics like the concordance index, their ability to offer calibrated survival probabilities is under scrutiny.
Unpacking the Calibration Challenge
In a recent audit involving model architectures across multiple cancer types, a troubling pattern emerged. The analysis revealed that a majority of these models faltered in maintaining 1-calibration accuracy. Specifically, in an experiment with three models analyzing TCGA-BRCA data, 12 out of 15 fold-level tests failed the calibration check. That's worrisome when you consider the clinical stakes involved.
Let's not forget the broader picture. The audit spanned 290 fold-level tests, and a staggering 166 failed to reject the null hypothesis of proper calibration at the median event time. When a model like MCAT achieves a concordance index of 0.817 on GBMLGG but fails calibration across all five folds, it signals a fundamental issue. Is a high C-index enough if the probabilities aren't trustworthy?
Fusion Methods and Calibration
Different fusion strategies were evaluated for their calibration potential. Gating-based fusion, for example, showed better calibration results compared to bilinear and concatenation methods. Yet, it's not a silver bullet. Post-hoc techniques like Platt scaling show promise, reducing miscalibration rates without affecting the model's ability to discriminate. MCAT, for instance, improved from failing all five folds to passing in three.
However, relying on post-hoc fixes isn't a long-term solution. It's akin to patching a leaky boat instead of building a seaworthy vessel. The focus should be on developing inherently calibrated models from the ground up.
The Road Ahead
Why should we care? Well, these models hold potential for revolutionizing cancer treatment, offering personalized survival probabilities that inform patient care. But if those probabilities aren't reliable, the clinical utility is compromised. It's a classic AI conundrum: impressive on paper, yet shaky in practice.
Slapping a model on a GPU rental isn't a convergence thesis. True convergence, where AI models deliver both discriminative power and calibrated outputs, is still a work in progress. Meanwhile, researchers and clinicians need to be cautious, ensuring these tools don't mislead with overconfident predictions.
The intersection is real. Ninety percent of the projects aren't. But those that are, need to pass the calibration test if they're to make a real-world impact. It's time to ask: Are we prioritizing the right metrics in AI development for healthcare?
Get AI news in your inbox
Daily digest of what matters in AI.