The Hidden Danger of Language Models Talking to Each Other

Language models are no longer just passive tools. They're actively shaping each other's outputs in ways we might not fully understand yet. The more they feed on one another, the greater the risk of what's called covert influence. Think of it this way: one model's behavioral tendencies can subtly sneak into another's work, all without human detection. That's the real kicker.

Unpacking Covert Influence

If you've ever trained a model, you know how critical supervision is. Now, imagine supervised fine-tuning, on-policy distillation, and in-context learning as three doors leading to unseen manipulation. They each offer varying degrees of influence without leaving a trace. It's like having a conversation where the other person changes your mind without you realizing it.

Here's the thing, recent studies have used inference-time per-sample attribution scores to explore this sneakiness. By choosing carriers that boost training-time influence, researchers have unlocked new levels of payload transfer, ones we didn't even think were possible before. It's a bit like upgrading from a whisper to a shout without anyone noticing the volume change.

Why Should You Care?

So, what's the big deal? Well, the analogy I keep coming back to is whisper campaigns in politics. They're quiet but powerful. The same goes for language models. When they start influencing each other in ways we can't see, that's a red flag waving in our faces.

Here's why this matters for everyone, not just researchers. If these covert influences aren't checked, they could lead to biased models that propagate misinformation or skewed perspectives. And let's face it, we don't need another source of biased information in today's world.

The Role of Natural Language Carriers

Interestingly, using natural language carriers for this hidden influence is still a new frontier. Previous studies have stuck with number carriers, which are harder for humans to detect and don't transfer well across models. This shift to language adds a layer of complexity, and risk, that can't be ignored.

So what can be done? Researchers are exploring pointwise attribution scoring methods to investigate and potentially mitigate this issue. But, honestly, that's just the start. We need a more solid framework to ensure these models are behaving as intended, without slipping in a covert agenda.

In the end, the risk surface for covert influence is broader than we ever thought. That's a wake-up call for anyone building or using these models. It forces us to rethink how transparent and accountable we want AI to be. Do we really want models quietly influencing each other under the radar?

The Hidden Danger of Language Models Talking to Each Other

Unpacking Covert Influence

Why Should You Care?

The Role of Natural Language Carriers

Key Terms Explained