THRD: A New Approach to Multi-Turn Jailbreak Defense

Multi-turn jailbreak attacks are a growing threat to large language models (LLMs). These attacks exploit conversational dynamics like gradual escalation, challenging existing defenses that either degrade model utility or fail by treating turns in isolation. The revelation? Safety behavior in multi-turn interactions depends on the dialogue trajectory.

Understanding Trajectory-Dependent Risk

Traditional single-turn analysis falls short because it can't capture how risk accumulates across an interaction. This oversight leaves models vulnerable to sophisticated, multi-turn attacks. The key finding: dialogue history reshapes a model's conditioning context. This demands a framework capable of explicit temporal risk modeling.

Introducing THRD

Enter THRD, a training-free framework tackling this challenge head-on. It comprises four modules: a Turn-level Risk Assessor (TRA) for instant risk estimation, a Historical Context Analyzer (HCA) for detecting intent escalation, a Response Evaluator (RE) for spotting facilitative outputs, and a Decision Module. These components work together through a dynamic scoring system, adjusting for time and trends in the dialogue.

THRD's performance against state-of-the-art multi-turn attacks is impressive. Tested on tree-search-based and multi-agent collaborative methods across two target models, it slashes attack success rates (ASR) to just 0.2-4.0%. Crucially, it does so while keeping model utility losses under 1.5% on MMLU and GSM8K benchmarks.

A Closer Look at the Modules

The ablation study reveals each module's non-redundant contributions and confirms THRD's generalization across architectures. But why does temporal aggregation matter? More than 70% of multi-turn attacks are only detectable from Turn 2 onward. This necessity for temporal awareness underscores the inadequacy of isolated turn analysis.

Looking Ahead

Why should readers care about THRD? It's a key advancement in LLM security, addressing a critical gap in existing defenses. As LLMs become more integrated into sensitive applications, solid security frameworks like THRD are essential.

The question remains: will THRD set the standard for future defense mechanisms, or is it just the beginning of a broader shift towards trajectory-dependent analysis in AI security?