When Correctness Isn't Enough: The Real Challenge of...

When Correctness Isn't Enough: The Real Challenge of In-Context Learning

By Signe EriksenMay 27, 2026

A new study reveals that correctness in exemplar demonstrations isn't always beneficial for in-context learning. Task-preserving perturbations show that correct input-output pairs can sometimes hinder rather than help.

AI, in-context learning (ICL) is often praised for its intuitive approach: use demonstrations as a guide to provide correct input-output examples. But what if I told you that correctness doesn't always equate to utility? Surprising, right?

The Correctness-Utility Gap

The researchers behind a recent study unveiled a counterintuitive phenomenon. Correct demonstrations can sometimes reduce ICL accuracy. To explore this, they introduced 'task-preserving perturbations'. These are changes where the exemplar input is altered, yet the example remains correct for the same task. Essentially, these perturbations either update labels, changing semantics and targets, or they preserve targets, where the original target stays valid.

The paper's key contribution: a formalization of what they call 'contextual evidence shift'. This shift occurs because these task-preserving perturbations alter the evidence mixture the model uses for contextual inference. Simply put, correctness and utility can sometimes part ways.

Impact Across Tasks

What's fascinating is the variety of tasks tested: sentiment classification, logical reasoning, and math word problems. The results? Substantial degradation in ICL performance, especially in smaller models, more challenging tasks, and higher perturbation ratios. This isn't just an academic exercise, it's a wake-up call for developers relying on ICL.

Let's not forget the practical implications. If correct examples can backfire, then reliable ICL needs more than correctness. It demands an evaluation of how demonstrations shape contextual inference. Are developers ready for this added complexity?

Why This Matters

This builds on prior work from several fields, but it fundamentally challenges the assumption that correct answers are always beneficial. Sure, correctness is a foundational concept in machine learning, but the ablation study reveals deeper dynamics at play. With perturbations potentially misleading models, the industry may need to rethink how ICL is approached, particularly as we push models to tackle more nuanced and complex tasks.

Code and data are available atGitHub. But the takeaway here isn’t just about code. It’s a chance for the AI community to reassess how model training is structured. Do we need new strategies to handle these perturbations? The answer is a resounding yes.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.

When Correctness Isn't Enough: The Real Challenge of In-Context Learning

The Correctness-Utility Gap

Impact Across Tasks

Why This Matters

Key Terms Explained