Neural Retrievers: More Biased Than You Thought

When you think of neural retrievers, you probably imagine them as unbiased machines sifting through mounds of data to find the most relevant documents. But here's the thing: they might be a bit more biased than we'd like to believe.

The Problem with Annotation

Neural retrievers are trained using annotated query-document pairs. It's a pretty standard practice. But the problem lies in how these annotations are selected. It's not just about relevance. it's also about which documents get annotated in the first place. Think of it this way: if you're only labeling documents that fit mainstream narratives, you're training your model to favor those narratives.

Researchers have found that supervised bi-encoder retrievers inadvertently learn something they weren't supposed to: a document-level relevance prior. This is more of a side effect rather than an intended feature. These retrievers can generalize this bias to documents they haven't even seen yet, creating a bias in the retrieval process that favors certain types of documents over others.

Impact on Information Retrieval

Here's why this matters for everyone, not just researchers. In our daily quest for information, the retrievers we rely on are systematically less accurate when dealing with niche, fragmented, or highly technical content. These documents, despite being relevant, get a lower retrieval ranking due to the biased learning of the retrievers. It's a findability gap, plain and simple.

If you've ever trained a model, you know how sensitive they can be to the training data's quirks. The analogy I keep coming back to is training a model with selective hearing. Sure, it hears you, but only the parts it was conditioned to listen for.

A Bias for the Mainstream

Using large language model (LLM) explanations, researchers revealed that documents deemed relevant often align with mainstream topics, think comprehensive and self-contained summaries. On the flip side, content that's niche or highly technical often gets sidelined. This isn't just an academic concern. In a world where diverse viewpoints are important, such biases can skew the information landscape drastically.

Now, is this bias inescapable? That's the million-dollar question. While traditional models like BM25 show this bias less consistently, they aren't entirely immune either. The structural limitations of supervised retrieval mean these biases are often baked in from the start.

Where Do We Go From Here?

So, what's the fix? Do we need a fundamental rethink of how we train these models? Maybe. Or perhaps it's about diversifying the data we use for training to include those less glamorized documents. Whatever the solution, one thing's for sure: relying purely on annotated data without accounting for its biases is asking for trouble.

As we move forward, the challenge is clear. Balancing innovation in neural retrievers with the need for unbiased information. Can we've both?, but one thing's certain: the conversation around this needs to get louder.