New AI Model Boosts Medical Code Retrieval Across Languages

JUST IN: A breakthrough in sentence-embedding models might change how we handle clinical data across languages. Traditionally, these models are tailored for English, but a new approach is shaking things up by focusing on other languages like Spanish and Portuguese.

The Need for Multilingual Models

retrieving clinical data in non-English languages, the current models fall short. They miss the precision needed for accurate medical coding, especially when dealing with ICD-10-CM codes. But there's a new solution on the horizon, one that could bridge this gap using large generative language models as data factories.

Researchers have developed a two-stage retriever, combining a bi-encoder with a cross-encoder reranker. This system was fine-tuned using a Spanish biomedical encoder, PlanTL-GOB-ES/bsc-bio-ehr-es, powered by Gemini-generated synthetic data. The result? A model that covers English, Spanish, Catalan, Italian, Portuguese, and French.

Performance That Speaks Volumes

The numbers are in, and they're impressive. The bi-encoder stands toe-to-toe with BioBERT-ST, matching its Mean Reciprocal Rank (MRR) at 0.876 versus 0.866. It even surpasses it in Recall at Rank 3 and 5, without the need for English biomedical pretraining. And with a cross-encoder reranker added, aggregate Recall at Rank 5 jumps to 0.822.

Here's the kicker: in Portuguese, the model achieves a Recall at Rank 5 of 0.829, a massive leap from BioBERT-ST's 0.714. Considering Portuguese's growing influence in global clinical research, this is a breakthrough.

Why This Matters

And just like that, the leaderboard shifts. This isn't just about numbers. It's about making medical data more accessible and accurate, no matter the language. For a field as critical as healthcare, these improvements aren't just welcome, they're essential.

Is it worth it to embrace a small regression in English performance for these gains? Absolutely. The trade-off is clinically acceptable and opens doors for non-English speaking countries to better use their medical data. Moreover, the model provides an open recipe for building domain-specific retrievers using LLM-generated data.

The labs are scrambling to keep up, and for good reason. This approach not only quantifies learning gains but also pinpoints where these gains concentrate by language and rank. It sets a new benchmark for multilingual clinical data retrieval, and it's about time.