Rethinking TTS Evaluation: A New Framework for Low-Resource Languages
The new INSV-A framework aims to provide a clearer evaluation of text-to-speech systems for languages like Pashto. By separating intelligibility, naturalness, and script fidelity, it delivers a more nuanced view of system performance.
Text-to-speech (TTS) technology faces unique challenges when dealing with low-resource languages, particularly those that use non-Latin scripts. Traditional evaluation metrics, like a single ASR round-trip word error rate (WER), often fall short in capturing the full picture. Enter INSV, a new framework that dissects TTS performance into intelligibility, naturalness, script fidelity, and verification.
Breaking Down INSV
INSV doesn't just lump all issues together. Instead, it separates them into distinct categories. The latest iteration, INSV-A, focuses on automated screening. It evaluates synthesis completion, ASR WER/CER, transcript script fidelity, and audio language identification. Notably, it skips subjective measures like native Mean Opinion Scores (MOS) and phonetic annotations, sticking to what machines can measure.
What does this mean for TTS developers? A more detailed insight into where their systems excel and where they falter. For example, PashtoTTS-Bench uses INSV-A to assess various TTS models, including Edge GulNawaz, Edge Latifa, and OmniVoice systems, on datasets like FLEURS and Common Voice 24. This is a real step forward in understanding how these systems perform in low-resource settings.
Competitive Performance
During the April-May 2026 evaluations, OmniVoice auto emerged as a leader, achieving the lowest WER of 24.1% on FLEURS and 27.4% on CV24. Surprisingly, these figures were better than natural speech baselines, highlighting the potential of synthetic audio. Edge GulNawaz followed with 32.8% and 39.5%, respectively, showing decent performance but with room for improvement.
The question is, can these synthetic voices truly match up to the nuances of native speech? While the WER figures are promising, they don't tell the whole story. Whisper Large V3, for instance, failed to recognize Pashto labels in the audio, highlighting a significant gap in language identification capabilities.
Why It Matters
So why should we care about these technical evaluations? Because the AI-AI Venn diagram is getting thicker. We're building the financial plumbing for machines, where communication between AI systems needs to be smooth and effective. If machines can't accurately synthesize and understand each other's languages, we're setting ourselves up for a future where language barriers aren't just a human problem.
The release also offers rich metadata, including provider information, per-sentence scores, and language identification audits. This level of transparency and detail equips researchers and developers with the tools they need to push TTS technology forward. The compute layer needs a payment rail, and accurate, nuanced evaluation frameworks like INSV-A are vital steps in that direction.
Get AI news in your inbox
Daily digest of what matters in AI.