Why Multilingual Model Merging Falls Short
Merging fine-tuned models without original data is tempting but often fails in multilingual contexts. Let's unpack why neurons resist merging magic.
Weight-space model merging might sound like a dream, combining independently fine-tuned models without the hassle of original training data. Yet, multilingual machine translation, this approach hits a wall. So what gives?
The Multilingual Misstep
Fine-tuning language models on bilingual corpora and then merging them with standard strategies might work in multitask environments. But in multilingual translation, the outcome is less than stellar. Our experiments show that merging degrades performance significantly when target languages diverge. Which begs the question: why are multilingual contexts so resistant?
Neurons Aren't Cooperating
To crack this puzzle, we dove into the neural networks' internal representations. Using span-conditioned neuron selectivity and layer-wise centered kernel alignment, we found something intriguing. Neurons specific to each language tend to cluster in the embedding layers and upper transformer blocks. Intermediate layers, however, stay largely shared. The rub? Fine-tuning redistributes language selectivity.
Instead of honing in on language-specific neurons, fine-tuning makes them less exclusive for supervised and related languages. Meanwhile, neurons for unsupervised languages grow more isolated. This redistribution amplifies divergence in higher layers, precisely where generation is governed. Not exactly the recipe for merging success.
Geometry of Fine-Tuning
What does this mean for weight-space merging? The geometry of fine-tuning reshapes language model architecture in ways that undermine compatibility with merging assumptions. The shared layers aren’t doing the heavy lifting you’d expect. So, if you're banking on merging to solve your multilingual model woes, you might want to reconsider.
Slapping a model on a GPU rental isn't a convergence thesis. It's time we accept that multilingual fine-tuning requires more nuanced strategies than the blunt instrument of merging. Show me the inference costs. Then we'll talk about innovation.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A dense numerical representation of data (words, images, etc.
The process of taking a pre-trained model and continuing to train it on a smaller, specific dataset to adapt it for a particular task or domain.
Graphics Processing Unit.
Running a trained model to make predictions on new data.