Language Model Safety: Somali Gets Short-Changed
Recent evaluations reveal significant gaps in language model safety for Somali. English gets more focus, leaving Somali with higher risks.
Large language models are reshaping how we interact with technology, but there's a glaring inconsistency in safety evaluations. While English is often at the forefront, lower-resource languages like Somali remain largely overlooked. A recent study highlights this disparity, examining how four instruction-tuned models fare with harmful-intent prompts in both English and Somali.
Dissecting the Disparity
Let's break this down. Researchers tested Llama-3.1-8B, Gemma-2-9B, Qwen-2.5-7B, and Aya-23-8B models on SomaliBench v0. This benchmark consists of 100 author-verified harmful-intent prompts translated into both English and Somali. The models were run with a fixed temperature and a consistent prompt aiming for outputs that are helpful, harmless, and honest.
The results? Stark refusal rate gaps between English and Somali. Llama led with a 0.90 English refusal rate, while its Somali counterpart lagged. Gemma's refusal rate was a mere 0.38 for Somali prompts, illustrating the chasm.
What's Really Happening?
Strip away the marketing and you get inconsistent safety across languages. For Somali, the models often returned unclear outputs, be it empty responses, wrong languages, or incoherent text. While some may argue this isn't as harmful as dangerous compliance, the reality is it undermines the model's reliability.
Notably, the native author verification showed perfect agreement with the judge on sampled responses. This underscores the importance of human oversight, especially in cases where machine outputs are ambiguous.
Why It Matters
Why should this concern you? Consider the global deployment of these models. If they're less safe in Somali, what does that mean for other underrepresented languages? Are we inadvertently creating a digital divide where low-resource languages are second-class citizens in the AI world?
Here's what the benchmarks actually show: we need to prioritize safety in all languages, not just English. It's time developers shift focus to ensure equitable AI practices across linguistic lines. Because in AI, the architecture matters more than the parameter count. How we build and evaluate these models dictates how inclusive the future of AI will be.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A standardized test used to measure and compare AI model performance.
An AI model that understands and generates human language.
Meta's family of open-weight large language models.
A value the model learns during training — specifically, the weights and biases in neural network layers.