Unmasking Calibration Flaws in Multimodal Cancer Survival Models
Despite impressive performance metrics, multimodal cancer survival models are missing the mark on calibration. A deep dive reveals troubling inconsistencies.
Multimodal deep learning models are making waves in the area of cancer prognosis, boasting impressive discriminative capabilities. Fusing whole-slide histopathology images with genomic data, these models have drawn attention for their reliable concordance index scores. But wait. There's a snag in the system: their survival probability predictions often lack proper calibration.
Calibration Crisis
Let's apply some rigor here. A recent systematic audit, the first of its kind, examined the calibration of multimodal WSI-genomics survival architectures. Evaluating native discrete-time survival outputs (Experiment A) and Breslow-reconstructed survival curves (Experiment B), the findings were illuminating, if not alarming. Across 290 fold-level tests, an astounding 166 failed the test for correct calibration. What's being overlooked is that a high concordance index doesn't guarantee clinical reliability.
Experiment A scrutinized three models using TCGA-BRCA data. Disturbingly, all three failed 1-calibration in the majority of folds. Specifically, 12 out of 15 fold-level tests were rejected after the Benjamini-Hochberg correction was applied. Similarly, in Experiment B, the celebrated MCAT model, despite achieving a C-index of 0.817 on GBMLGG, faltered on calibration across all five folds tested.
Fusion and Fixes
Gating-based fusion methods show promise in improving calibration, while bilinear and concatenation fusion aren't up to snuff. What they're not telling you is that post-hoc techniques like Platt scaling can mitigate these calibration issues without compromising discriminative power. For instance, Platt scaling reduced MCAT's miscalibration from all five folds failing to just two out of five.
These findings beg the question: why is the industry so focused on the concordance index as a sole measure of effectiveness? Color me skeptical, but it's clear that using these models in clinical settings without addressing calibration issues could lead to misleading prognoses.
The Road Ahead
To be fair, the field of multimodal deep learning for cancer prognosis is still in its relative infancy. However, these calibration discrepancies highlight a key oversight in the rush to deploy these models. Are we willing to risk patient outcomes on models that don't hold up under scrutiny? As the push for AI-driven healthcare continues, ensuring proper calibration should be a non-negotiable part of model evaluation.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A mechanism that lets neural networks focus on the most relevant parts of their input when producing output.
A subset of machine learning that uses neural networks with many layers (hence 'deep') to learn complex patterns from large amounts of data.
The process of measuring how well an AI model performs on its intended task.
AI models that can understand and generate multiple types of data — text, images, audio, video.