FLEURS-Kobani: A Groundbreaking Dataset for Northern Kurdish

Northern Kurdish speakers have faced a persistent challenge automatic speech recognition and translation tasks. The absence of a strong dataset meant their linguistic needs were often left unmet. Enter FLEURS-Kobani, a pioneering dataset that finally extends the FLEURS benchmark to include Northern Kurdish, identified by the ISO code KMR.

Breaking New Ground

FLEURS-Kobani isn't just another dataset. It comprises 5,162 validated utterances, encompassing 18 hours and 24 minutes of recordings from 31 native speakers. This isn't merely about numbers. It's about setting a precedent for linguistic inclusivity in AI. For too long, under-resourced languages like Northern Kurdish have been sidelined in AI development. This dataset symbolizes a shift towards acknowledging and addressing these gaps.

Performance and Results

The dataset didn't just stop at providing data. Baseline results from fine-tuning the Whisper v3-large model for Automatic Speech Recognition (ASR) and End-to-End Speech-to-Text Translation (E2E S2TT) reveal promising outcomes. A meticulous two-stage fine-tuning strategy, transitioning from Common Voice to FLEURS-Kobani, produced the best ASR performance with a Word Error Rate (WER) of 28.11% and a Character Error Rate (CER) of 9.84% on the test set. For the E2E S2TT task translating KMR to English, Whisper achieved a BLEU score of 8.68. But let's apply the standard the industry set for itself, these results, while promising, still highlight the need for continuous refinement and improvement.

A Step Forward, But Not the End

FLEURS-Kobani is publicly available under a CC BY 4.0 license, inviting researchers worldwide to further explore and enhance AI's capabilities in understanding Northern Kurdish. However, the burden of proof sits with the team, not the community. Will this dataset evolve into a benchmark that truly reflects the linguistic intricacies of Northern Kurdish? That's a question only time, and further research, will answer.

The industry often trumpets its global ambitions, yet too frequently neglects the very diversity it claims to champion. FLEURS-Kobani challenges this narrative, serving as a reminder that true progress in AI requires embracing every language, not just the widely spoken. In a world where technology shapes communication, ensuring all voices are heard isn't just desirable, it's essential.

FLEURS-Kobani: A Groundbreaking Dataset for Northern Kurdish

Breaking New Ground

Performance and Results

A Step Forward, But Not the End

Key Terms Explained