Biomedical Language Models’ Embedding Problem: A Fix in Sight
Biomedical language models confuse unrelated terms. New methods show promise in improving accuracy. Key advancements and performance metrics revealed.
Biomedical language models, often used to interpret complex relationships in medical texts, are showing significant limitations. Ask one of these models if 'cortisol 28 ug/dL' and 'stock-market volatility' are related, and you'll get an alarming cosine similarity of 0.83. That's on a scale where 1.0 means identical. This isn't a one-off error, but a systemic issue.
Testing with various encoders like BioBERT and PubMedBERT reveals that unrelated cross-domain pairs score between 0.76 and 0.92. They should register near zero. The accuracy of these models on cross-domain discrimination is abysmally 0%. So, what's the impact?
The Downstream Effect
Retrieval systems manage to survive this discrepancy because downstream language models filter out the noise. However, Large Behavioural Models (LBMs), which reason over a graph of a user's life, are more susceptible. They could incorrectly infer causal relationships due to false proximity in embeddings, cascading errors through the system. Here, embedding geometry isn't just a parameter to tweak. It's essential for accuracy.
Innovative Solutions
The research reports a promising fix. A contrastive pass over 72,034 pairs has significantly improved metrics. For instance, PubMedBERT's BIOSSES correlation rose from 0.633 to 0.828. More impressively, a method called BODHI, which mines hard negatives from a biomedical knowledge graph, has increased within-vs-across-domain separation to 2.30x, also widening the discrimination gap by +0.392, albeit with a modest 4.5% reduction in BIOSSES score.
Performance Breakthroughs
Performance isn't just about accuracy. It's also about speed. On an Intel Xeon 6737P with AMX, using OpenVINO reduces single-query latency drastically from 1367 ms to just 10 ms, an incredible 133x improvement, processing 555 sentences per second. Counter to conventional wisdom, FP16 outperforms INT8 across all batch sizes on this silicon. Yet, without AMX, performance on a no-AMX Ice Lake instance nosedives by 13-27x.
Code and data are available. The researchers have released the benchmark suite, training corpora, the BODHI generator, and OpenVINO scripts, offering a comprehensive toolkit for further exploration and validation.
Why should you care? This research not only highlights a critical flaw but also proposes a viable solution. In the accelerating world of biomedical AI, ensuring accurate interpretations can have tangible impacts on patient outcomes and healthcare innovations.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A standardized test used to measure and compare AI model performance.
A dense numerical representation of data (words, images, etc.
A structured representation of information as a network of entities and their relationships.
A value the model learns during training — specifically, the weights and biases in neural network layers.