Rescuing Reasoning in Distilled Language Models
Distilled language models often sacrifice reasoning ability for efficiency. A new method, RED, promises to reverse this trade-off without losing performance.
Efficiently distilling large language models is a balancing act. You want to keep the performance high while shrinking the size. EDistill, a well-known method, does this by pruning parameters and tweaking lightweight modules. But there's a catch. These distilled models show a significant drop in multi-step reasoning, a phenomenon researchers call 'reasoning collapse'.
The Problem with EDistill
Here's what the benchmarks actually show: EDistill models perform admirably on general ability tests compared to their size. However, their capability to handle complex reasoning tasks takes a hit. The issue links to something called eRank collapse, where the effective rank of hidden representations declines, making tokens indistinguishable.
What's causing this? It's down to the way projection matrices, specifically those reducing model width, are initialized. When singular values from these matrices distribute unevenly, the models lose some of their reasoning heft.
Introducing RED: A New Hope
To tackle this, researchers propose a new approach: Reasoning-preserved Efficient Distillation (RED). It uses activation-aware initialization to set up projection matrices as channel-selection matrices. In theory, this keeps the reasoning intact by avoiding eRank collapse.
Experiments on popular Llama and Qwen models back this up. RED manages to recover reasoning capabilities while still clinging to the high training efficiency and general ability excellence that EDistill is known for.
Why It Matters
Strip away the marketing and you get a simple question: Should we sacrifice reasoning for efficiency? The reality is, as AI permeates fields requiring logical deduction, like law and medicine, reasoning abilities become indispensable. A model that can't string together multiple logical steps won't cut it.
So, is RED the big deal we need? Frankly, the numbers tell a different story. They hint that a sweet spot is possible: efficient models that don't compromise on reasoning. If RED delivers on its promise, it could set a new standard for distilled models.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A technique where a smaller 'student' model learns to mimic a larger 'teacher' model.
Meta's family of open-weight large language models.
The ability of AI models to draw conclusions, solve problems logically, and work through multi-step challenges.
The process of teaching an AI model by exposing it to data and adjusting its parameters to minimize errors.