PolySpeech-100: Redefining Speech-LLM Benchmarks

Speech-Large Language Models (Speech-LLMs) are evolving faster than you can say 'semantic reasoning,' but how we evaluate them hasn't kept pace. The old benchmarks focus too much on transcription and overlook the rich variety of regional dialects and low-resource languages. Enter PolySpeech-100, a major shift Speech-LLMs.

Why We Need a New Benchmark

Think of it this way: If you're only evaluating models based on how well they transcribe high-resource languages, you're missing out on a vast linguistic landscape. PolySpeech-100 shifts the focus to 'native-level' speech comprehension, covering 110 linguistic variants. This isn't just a nod to diversity. it's a necessary step for creating more inclusive AI systems.

PolySpeech-100 goes beyond traditional benchmarks with a hybrid construction pipeline. It combines human recordings with synthetic speech, allowing it to capture nuances across 19 Chinese dialects and over 80 low-resource languages. If you've ever trained a model, you know the challenges of low-resource languages. This benchmark offers a reliable solution.

Key Findings: E2E Models and the Modality Gap

Here's the thing: the evaluation of 22 state-of-the-art models, including Gemini-3 and GPT-Audio, sheds light on a key insight. Open-source E2E models outperform Cascade systems handling heavy dialects. This suggests that direct audio processing captures essential paralinguistic cues and prosodic features often lost in transcription.

But it's not all rosy. The results also reveal a significant gap. While commercial models maintain their robustness, open-source ones fall short on low-resource languages, showing catastrophic performance degradation. It's a call to action for the open-source community to invest more resources into these languages.

Chain-of-Thought Prompting: A Double-Edged Sword?

In a surprising twist, the Chain-of-Thought prompting, often lauded for its effectiveness, actually degrades speech understanding in zero-shot settings for most evaluated models. So, what gives? It highlights a potential modality alignment gap in current architectures. Could this be a sign that our current approaches need a rethink?

PolySpeech-100 isn't just another benchmark. It's a bold step toward more inclusive, omni-capable models. By setting a rigorous standard, it challenges the field to look beyond the usual suspects and address the nuances of underrepresented languages. For researchers and developers, this is a call to elevate their game. And honestly, it's about time.

For those interested in diving deeper, the data, demo, and code are publicly available on GitHub. So go on, take a look and see how your favorite models stack up against this new standard.