The Hidden Flaws in Clinical AI: Why Format is the Real...

Clinical triage AI models, like Gemma 3 4B/12B IT and Qwen3-8B, are facing a peculiar roadblock. They're great with free-text inputs, but multiple-choice formats, they're floundering. These models struggle not because they lack clinical understanding, but because the output format itself is tripping them up.

Output Format: The Real Issue

The sparse-autoencoder (SAE) features reveal that the models retain consistent medical features across both free-text and multiple-choice inputs. Yet, these features mysteriously 'go silent' at the multiple-choice decision token. This indicates that it's not the clinical representation that's lacking. Instead, it's the format of the output that's leading the models astray.

Three independent methods, natural-language autoencoder verbalization, decision-token logit attribution, and top-feature characterization, concur. They show that non-medical features related to the scaffold and format of the input are the ones influencing decision logits. This isn't just a minor quirk, it's a systemic flaw in how these models are designed to interpret multiple-choice questions.

The Penalty of Multiple-Choice

The real kicker is how these models handle the penalty of multiple-choice. The AI's decision-making process often gets inverted, with models picking an adjacent acuity letter instead of the correct one. This isn't about a knowledge gap, it's more about being tricked by the format.

Interestingly, even when the structured and natural-language inputs are shuffled to eliminate positional biases, the issue persists. Does this mean we're misjudging AI capabilities based on an outdated testing format? If so, what does this imply for the future of AI in clinical settings?

The Path Forward

This isn't a partnership announcement. It's a convergence of technology and understanding that's long overdue. The AI-AI Venn diagram is getting thicker, and it's time we address these format-driven errors. As AI continues to integrate into critical sectors like healthcare, we're building the financial plumbing for machines, but let's not forget the importance of building the right cognitive frameworks too.

If agents have wallets, who holds the keys? It's time to reconsider how we test and trust our AI systems, ensuring they aren't only clinically competent but also format-savvy.

The Hidden Flaws in Clinical AI: Why Format is the Real Culprit

Output Format: The Real Issue

The Penalty of Multiple-Choice

The Path Forward

Key Terms Explained