When Similarities Mislead: Fixing Flaws in Biomedical Language Models
Biomedical language models falter on cross-domain tasks, confusing unrelated concepts. A new approach boosts accuracy, but challenges old assumptions.
Pretrained biomedical language models like BioBERT and PubMedBERT are stumbling over cross-domain tasks. These models, when asked if 'cortisol 28 ug/dL' and 'stock-market volatility' are related, return a startling cosine similarity of 0.83. This score suggests a high degree of similarity where common sense says there should be none. The paper's key contribution: identifying and addressing this critical flaw.
Decoding the Missteps
It's not just an isolated issue. Various off-the-shelf biomedical encoders consistently score unrelated pairings between 0.76 and 0.92. Accuracy for distinguishing cross-domain relationships stands at a dismal 0%. This isn't merely a technical hiccup. It poses a significant challenge in fields relying on precise semantic understanding, where embedding proximity should equate to accuracy.
The problem escalates with Large Behavioral Models (LBMs). These models interpret life events through a graph, mistaking embedding proximity as evidence of causal links. False proximities lead to false causal edges, contaminating downstream data processing. Here, embedding geometry is no minor detail. it's about correctness.
Innovative Fixes
Researchers have proposed a two-fold solution. First, a contrastive pass over 72,034 pairs elevates PubMedBERT BIOSSES correlation from 0.633 to 0.828. The within-versus-across-domain separation improves from 1.05x to 1.63x. The second pass, termed BODHI, mines hard negatives from absent edges in a biomedical knowledge graph. This step further boosts separation to 2.30x and narrows the discrimination gap to +0.392, albeit with a 4.5% BIOSSES cost. Are these trade-offs justified? The improvements suggest so.
Performance Gains and Controversies
On the hardware side, OpenVINO cuts single-query latency significantly, from 1367 ms to 10 ms, achieving 555 sentences per second on an Intel Xeon 6737P with AMX. Surprisingly, FP16 outperforms INT8 across all serving batch sizes, a finding that defies conventional wisdom. Without AMX, the same model runs 13-27 times slower on Ice Lake processors.
This work challenges established norms and forces a reevaluation of current practices. The release of the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts further underscores the importance of transparency and reproducibility in advancing AI research.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A standardized test used to measure and compare AI model performance.
A dense numerical representation of data (words, images, etc.
A structured representation of information as a network of entities and their relationships.
The process of teaching an AI model by exposing it to data and adjusting its parameters to minimize errors.