RASA: The Future of Safe Mixture-of-Experts Models

By Signe EriksenApril 7, 2026

RASA aims to tackle safety alignment in Mixture-of-Experts models by focusing on repairing Safety-Critical Experts, not entire models. This targeted approach shows promise in maintaining model integrity and performance.

Mixture-of-Experts (MoE) models pose unique challenges safety alignment. Their sparse routing mechanisms can complicate traditional fine-tuning methods. The research community has grappled with the potential for these models to exhibit degenerate optimization behaviors under standard fine-tuning approaches. Herein lies the problem: how do we ensure safety without compromising the model's capabilities?

Why RASA Matters

Enter RASA, a routing-aware expert-level alignment framework, which represents a significant stride forward. Instead of blanket adjustments to full parameters, RASA zeroes in on the experts disproportionately activated during successful jailbreaks. By selectively fine-tuning these Safety-Critical Experts under fixed routing, the framework not only repairs but also fortifies the model against routing-based bypasses.

Crucially, RASA's approach maintains routing consistency with safety-aligned contexts. This is a breakthrough. Why? Because it circumvents the pitfalls of global parameter updates, which can inadvertently weaken the model's robustness.

Robustness and Performance

Results from testing RASA across two MoE architectures and various jailbreak attacks are compelling. The framework achieves near-perfect robustness and strong cross-attack generalization. It also significantly reduces over-refusal rates, a persistent concern with prior alignment methods.

RASA preserves the general capabilities of models on benchmarks like MMLU, GSM8K, and TruthfulQA. This balance of safety and performance is rare, offering a practical alternative that doesn't demand architectural compromises.

A Practical Approach

The paper's key contribution lies in its practical and architecture-preserving solution. By targeting only the components that pose safety risks, RASA avoids unnecessary model-wide changes. This specificity stands in stark contrast to traditional methods that often overreach, leading to reduced model efficacy.

In a world where AI safety is key, RASA's targeted approach provides a template for future advancements. Could this signify a shift in how we approach model safety? Perhaps. What's clear is that focusing on expert-level repairs rather than wholesale adjustments can lead to more secure and effective models.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.

RASA: The Future of Safe Mixture-of-Experts Models

Why RASA Matters

Robustness and Performance

A Practical Approach

Key Terms Explained