Fine-Tuning Whisper for Pashto: A Layered Approach

In the AI-driven world of language processing, Pashto often gets the short end of the stick. Whisper's pre-trained models have consistently failed to deliver accuracy, defaulting to Arabic, Dari, or Urdu scripts when working with Pashto audio. This has left users with word error rates spiraling above 100%, a figure that's as unhelpful as it sounds.

Challenges in Fine-Tuning

Four fine-tuning strategies were thrown into the ring to tackle this issue: vanilla full fine-tuning, LoRA, frozen-encoder, and a multistage Urdu-to-Pashto transfer. Each brought its own set of challenges and outcomes. Vanilla fine-tuning stole the spotlight with a word error rate (WER) of 21.22% on CommonVoice Pashto v20, outshining LoRA by 33.36 percentage points, frozen-encoder by 14.76 points, and Urdu transfer by a hefty 44.56 points.

However, not all strategies proved fruitful. The frozen-encoder method, which attempted to separate layer functions within whisper-base's architecture, saw a performance hit. Why? Freezing layers removed a significant chunk of the trainable capacity, an oversight that's hard to forgive in such a competitive space.

The Pashto Conundrum

On CommonVoice Pashto v24, whisper-small reached a WER of 24.89%, narrowly beating whisper-base even with a threefold increase in parameters. Meanwhile, whisper-large-v3-turbo recorded a WER of 23.37%, marking diminishing returns as the parameter count increased. The practical optimum seems to rest with whisper-small at 113 hours of audio data.

The AI-AI Venn diagram is getting thicker, but that's not always a good thing. Online augmentation provided a important 7.25 percentage point boost to WER, highlighting the importance of diverse and extensive data sets. Yet, even with these advancements, error analysis pinpoints recurring issues: confusion in word-final suffixes and retroflex substitutions involving Pashto's unique consonant /ts/ remain stubborn hurdles.

What's Next?

So, where do we go from here? The release of fine-tuned checkpoints and evaluation scripts on HuggingFace is a step in the right direction. But the industry needs more than just software tweaks. If agents have wallets, who holds the keys to unlocking language models that truly understand Pashto? The answer might lie in deeper collaboration and richer data sets.

Ultimately, the compute layer needs a payment rail dedicated to underserved languages. We're building the financial plumbing for machines, but are we being equitable in the process? As the world grows more interconnected, ensuring every language has a seat at the table isn't just fair, it's essential for progress.

Fine-Tuning Whisper for Pashto: A Layered Approach

Challenges in Fine-Tuning

The Pashto Conundrum

What's Next?

Key Terms Explained