Bias in AI: How Conversation History Shapes LLM Judgments
New research highlights how prior conversation biases AI models' judgments. Discover the impact of context polarity and why fresh context is important.
Large language models (LLMs) have become the backbone of automated evaluations, used to review code, moderate content, and score outputs. But recent findings point to a notable flaw: they're swayed by the polarity of past conversations, a phenomenon dubbed the accumulated message effect on LLM judgments (AMEL).
Understanding AMEL
In a study involving 84,088 API calls across 12 models from five different providers, researchers tested how these models react to conversation history saturated with positive or negative evaluations. The outcome? Models showed a tendency to drift towards the prevailing polarity of the conversation. Statistically, this shift isn't trivial. We're looking at a d-value of -0.17 with a p-value that's less than 10^-53.
Interestingly, this bias becomes more pronounced when the model is uncertain. High-entropy items, where the model's baseline uncertainty is higher, see a stronger drift (d = -0.36) compared to more deterministic ones (d = -0.15).
The Negativity Asymmetry
A critical revelation from the study is the negativity asymmetry. Negative conversation histories exert 1.52 times more influence than positive ones. This is significant (t = 13.03, p<10^-36 over 2,733 items) and suggests a fundamental imbalance in how LLMs process negative versus positive feedback.
It doesn't matter how long the context is. Whether it's five turns or fifty, the shift remains consistent (Spearman |r|<0.01, OLS slope p = 0.80). For AI developers, this is important information. It highlights a structural weakness that doesn't just disappear with more data.
Solutions and Implications
What can be done? A simple fix is using a fresh context for each item. When batching items, balancing the conversation history can mitigate bias. But isn't this just a band-aid on a larger issue?
The architecture matters more than the parameter count. As models scale up, from Anthropic's Haiku to Opus or OpenAI's Nano to GPT-5.2, bias diminishes but doesn't vanish. The shift remains evident, indicating that size isn't the ultimate solution.
Why does this matter? As AI becomes more integrated into decision-making processes, understanding its biases is essential. Would you trust a judge who changes their verdict based on prior unrelated cases? The numbers tell a different story. If LLMs are the future, we need to ensure they aren't just echo chambers of their training histories.
Get AI news in your inbox
Daily digest of what matters in AI.