Unmasking Calibration Flaws in Multimodal Cancer...

Multimodal deep learning models are making waves in the area of cancer prognosis, boasting impressive discriminative capabilities. Fusing whole-slide histopathology images with genomic data, these models have drawn attention for their reliable concordance index scores. But wait. There's a snag in the system: their survival probability predictions often lack proper calibration.

Calibration Crisis

Let's apply some rigor here. A recent systematic audit, the first of its kind, examined the calibration of multimodal WSI-genomics survival architectures. Evaluating native discrete-time survival outputs (Experiment A) and Breslow-reconstructed survival curves (Experiment B), the findings were illuminating, if not alarming. Across 290 fold-level tests, an astounding 166 failed the test for correct calibration. What's being overlooked is that a high concordance index doesn't guarantee clinical reliability.

Experiment A scrutinized three models using TCGA-BRCA data. Disturbingly, all three failed 1-calibration in the majority of folds. Specifically, 12 out of 15 fold-level tests were rejected after the Benjamini-Hochberg correction was applied. Similarly, in Experiment B, the celebrated MCAT model, despite achieving a C-index of 0.817 on GBMLGG, faltered on calibration across all five folds tested.

Fusion and Fixes

Gating-based fusion methods show promise in improving calibration, while bilinear and concatenation fusion aren't up to snuff. What they're not telling you is that post-hoc techniques like Platt scaling can mitigate these calibration issues without compromising discriminative power. For instance, Platt scaling reduced MCAT's miscalibration from all five folds failing to just two out of five.

These findings beg the question: why is the industry so focused on the concordance index as a sole measure of effectiveness? Color me skeptical, but it's clear that using these models in clinical settings without addressing calibration issues could lead to misleading prognoses.

The Road Ahead

To be fair, the field of multimodal deep learning for cancer prognosis is still in its relative infancy. However, these calibration discrepancies highlight a key oversight in the rush to deploy these models. Are we willing to risk patient outcomes on models that don't hold up under scrutiny? As the push for AI-driven healthcare continues, ensuring proper calibration should be a non-negotiable part of model evaluation.

Unmasking Calibration Flaws in Multimodal Cancer Survival Models

Calibration Crisis

Fusion and Fixes

The Road Ahead

Key Terms Explained