When Similarities Mislead: Fixing Flaws in Biomedical...

Pretrained biomedical language models like BioBERT and PubMedBERT are stumbling over cross-domain tasks. These models, when asked if 'cortisol 28 ug/dL' and 'stock-market volatility' are related, return a startling cosine similarity of 0.83. This score suggests a high degree of similarity where common sense says there should be none. The paper's key contribution: identifying and addressing this critical flaw.

Decoding the Missteps

It's not just an isolated issue. Various off-the-shelf biomedical encoders consistently score unrelated pairings between 0.76 and 0.92. Accuracy for distinguishing cross-domain relationships stands at a dismal 0%. This isn't merely a technical hiccup. It poses a significant challenge in fields relying on precise semantic understanding, where embedding proximity should equate to accuracy.

The problem escalates with Large Behavioral Models (LBMs). These models interpret life events through a graph, mistaking embedding proximity as evidence of causal links. False proximities lead to false causal edges, contaminating downstream data processing. Here, embedding geometry is no minor detail. it's about correctness.

Innovative Fixes

Researchers have proposed a two-fold solution. First, a contrastive pass over 72,034 pairs elevates PubMedBERT BIOSSES correlation from 0.633 to 0.828. The within-versus-across-domain separation improves from 1.05x to 1.63x. The second pass, termed BODHI, mines hard negatives from absent edges in a biomedical knowledge graph. This step further boosts separation to 2.30x and narrows the discrimination gap to +0.392, albeit with a 4.5% BIOSSES cost. Are these trade-offs justified? The improvements suggest so.

Performance Gains and Controversies

On the hardware side, OpenVINO cuts single-query latency significantly, from 1367 ms to 10 ms, achieving 555 sentences per second on an Intel Xeon 6737P with AMX. Surprisingly, FP16 outperforms INT8 across all serving batch sizes, a finding that defies conventional wisdom. Without AMX, the same model runs 13-27 times slower on Ice Lake processors.

This work challenges established norms and forces a reevaluation of current practices. The release of the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts further underscores the importance of transparency and reproducibility in advancing AI research.

When Similarities Mislead: Fixing Flaws in Biomedical Language Models

Decoding the Missteps

Innovative Fixes

Performance Gains and Controversies

Key Terms Explained