Silent Saboteurs: How ThoughtSteer Hijacks Language Models

AI has a new antagonist, and it's not human. ThoughtSteer, a method that exploits continuous reasoning in language models, is demonstrating a remarkable ability to manipulate these systems. By perturbing a single embedding vector at the input layer, ThoughtSteer can hijack the latent trajectory of the models, steering them towards an attacker's desired answer.

Hidden Vulnerabilities

The AI-AI Venn diagram is getting thicker as this method reveals a significant vulnerability. ThoughtSteer achieves a stunning 99% attack success rate across two architectures, Coconut and SimCoT, and scales from 124 million to 3 billion parameters. This isn't just another security concern. it's a convergence of technological capability and vulnerability that's reshaping how we understand AI security.

In tests, ThoughtSteer maintained near-baseline clean accuracy and transferred its attack capabilities to held-out benchmarks without needing retraining, achieving a 94-100% success rate. It's a chilling reminder that AI's continuous reasoning models are more susceptible to manipulation than previously thought.

The Silent Attack

With no tokens and no audit trail, the attack surface is fundamentally new, and ThoughtSteer's ability to evade all five evaluated active defenses is a testament to its sophistication. Even after 25 epochs of clean fine-tuning, the attack persists. The compute layer needs a payment rail, but who's ensuring that these rails aren't being hijacked in the process?

Neural Collapse in the latent space is the unifying mechanism here. It pulls triggered representations onto a tight geometric attractor, explaining why defenses fail. The paradox is striking: individual latent vectors encode the correct answers, yet the model outputs the wrong ones. If agents have wallets, who holds the keys?

A New Lens for AI

ThoughtSteer isn't just exposing a vulnerability. it's offering a new lens for understanding continuous reasoning in AI. Backdoor perturbations, as revealed by this method, provide insight into mechanistic interpretability. The adversarial information isn't in a single vector but in the collective trajectory, challenging how we approach security in increasingly autonomous systems.

So, what's next? As the industry grapples with these findings, there's a pressing need to reinforce the computational infrastructure of AI. Are we ready to build the financial plumbing for machines that can reason in silence, or will we continue to patchwork defenses that ThoughtSteer has already outsmarted?

Silent Saboteurs: How ThoughtSteer Hijacks Language Models

Hidden Vulnerabilities

The Silent Attack

A New Lens for AI

Key Terms Explained