Unveiling Bias in Biomedical AI: A Call for Data...

Healthcare disparities are a stubbornly persistent issue, often linked to socioeconomic factors and access to care. However, the roots of these inequities run deeper, beginning not in hospitals but in data collection practices and research priorities. Particularly in studies involving molecular and omics data, biases are baked into the cake long before clinical application.

Data Gaps and Bias

An automated analysis of 4514 omics-focused publications from 2015 to 2024 reveals a startling lack of demographic data. Only 2.7% of studies report information on ancestry or ethnicity, with geographic origin seldom mentioned at just 2.5%. Why is this alarming? Because these gaps suggest that the large-scale datasets, which serve as the backbone for training AI models, are skewed.

Consider CellxGene and GEO, two commonly used datasets. They predominantly feature European ancestry data. This isn't just a minor oversight. it's a foundational flaw that risks perpetuating healthcare inequities by embedding bias into the algorithms that drive modern biomedical discovery.

The Risk of Amplifying Bias

As biomedical foundation models become the norm, these biases stand to be amplified. Pretrained on large, skewed datasets, these models are reused for various tasks, carrying forward the same biases into new contexts. Regulatory interventions can't fully undo this kind of systemic bias once it's ingrained in the AI models.

So, what's the solution? A shift towards transparent data practices is urgently needed. The paper's key contribution is its call for a community-wide focus on three principles: Provenance, Openness, and Reliability through Evaluation Transparency. These principles aim to illuminate biases and limitations, enabling more informed model development, evaluation, and deployment.

A Call for Action

The ablation study reveals a critical insight: without transparency, we risk reinforcing existing inequities. Can we afford to ignore the foundational biases in our datasets? The stakes are high. To create truly equitable AI, the biomedical community must prioritize transparency and accountability in data collection and reporting.

This isn't just about better science. it's about justice. The path forward involves collective action to ensure that AI models are trained on diverse and representative datasets. Only then can we hope to bridge the healthcare disparities that have persisted for too long.

Unveiling Bias in Biomedical AI: A Call for Data Transparency

Data Gaps and Bias

The Risk of Amplifying Bias

A Call for Action

Key Terms Explained