The Inconsistencies of MedGemma: A Cautionary Tale for...

As Large Language Models (LLMs) increasingly find their way into the intricate world of medical applications, a recent evaluation of MedGemma, a prominent model in this field, reveals concerning fragility. With versions comprising 4 billion and 27 billion parameters, MedGemma was put to the test on MedMCQA and PubMedQA, datasets designed to probe its robustness. Unfortunately, the results are underwhelming.

The Pitfalls of Prompting

The allure of Chain-of-Thought (CoT) prompts lies in their promise of improved reasoning. However, MedGemma's performance paints a different picture. When CoT prompting was employed, accuracy decreased by 5.7% compared to direct answering. This isn't just a fluke. it's a pattern. Even the introduction of few-shot examples, often heralded as a means to bolster understanding, led to an 11.9% drop in performance while simultaneously increasing position bias from 0.14 to a staggering 0.47.

What they're not telling you: these models, despite their impressive parameter counts, are far from infallible. The claim doesn't survive scrutiny when answer shuffling causes prediction changes 59.1% of the time, accompanied by a precipitous drop in accuracy of up to 27.4 percentage points.

The Truncation Trap

One might think that selectively presenting information to a model could enhance its focus. Yet MedGemma's reaction to truncation is telling. Front-truncating context to just 50% of its original length causes the model's accuracy to nosedive below even a no-context baseline. In stark contrast, back-truncation retains 97% of the full-context accuracy. This dichotomy suggests that not all context is created equal, and maintaining the integrity of the information presented is important.

Beyond Prompt Engineering

Interestingly, the study found that cloze scoring, choosing the option token with the highest log-probability, surpassed all other prompting techniques. MedGemma's smaller 4B version achieved 51.8%, while its larger counterpart hit 64.5%. This indicates that these models have a latent knowledge that often goes unexpressed through generated text alone.

Color me skeptical, but relying solely on established prompt engineering techniques validated on general-purpose models seems misguided when dealing with domain-specific applications like medical LLMs. If permutation voting can recover an additional 4 percentage points over single-ordering inference, it's clear that alternative strategies can yield significant improvements.

So, here's the pressing question: Are we ready to trust these models with critical medical decisions when their performance can be so easily swayed by something as trivial as prompt formatting? Perhaps it's time to acknowledge the limitations of these tools and prioritize the development of more reliable methodologies.

The Inconsistencies of MedGemma: A Cautionary Tale for Medical LLMs

The Pitfalls of Prompting

The Truncation Trap

Beyond Prompt Engineering

Key Terms Explained