Rethinking Depression Detection Benchmarks: The Numbers...

Clinical depression detection is under the microscope, with recent findings suggesting that benchmarks might not be as reliable as previously thought. This paper challenges the status quo, providing a fresh take on evaluation methods across several datasets, including DAIC/E-DAIC and CMDC.

Re-Evaluating E-DAIC

The E-DAIC protocol, long considered a standard, was put to test through a strict leave-one-subject-out method. The result? A macro-F1 score of 0.723, the highest reported so far. This score highlights the potential inaccuracies in relying on privileged official holdouts. It’s a conservative, yet telling, reference point.

But here’s the twist: when 96 different model configurations were tested against the E-DAIC official split, the results were all over the map. The top cross-validation performer landed just twentieth in official rankings. The supposed champion? It only managed to hit the top spot in a mere 32.3% of subject bootstraps. It raises the question: Are these benchmarks really reflective of true model performance?

CMDC and ANDROIDS: External Validation

Enter the CMDC and ANDROIDS datasets. These baselines, heralded for their near-ceiling in-domain results, struggled when applied outside their comfort zones. Zero-shot transfer tests showed marked performance declines, underlining the challenges of applying models universally without tailored adaptations.

Text vs. Audio: The Great Divide

One of the study's most striking insights came from comparing text and audio model responses to interviews. Text models soared when dealing with symptom-dense dialogue, while audio models barely budged. The gap is clear: text has the upper hand in detecting nuances that audio simply misses.

So, what's the takeaway here? The current benchmarks may not be the ultimate measure we once thought. If they can't provide consistent rankings, are they truly effective? It’s time the industry reevaluates how it measures success, ensuring models aren't just statistically impressive but genuinely reliable and reliable across scenarios.

Rethinking Depression Detection Benchmarks: The Numbers Don't Lie

Re-Evaluating E-DAIC

CMDC and ANDROIDS: External Validation

Text vs. Audio: The Great Divide

Key Terms Explained