Unmasking Bias in Multilingual Translation Metrics

Automatic evaluation metrics are the backbone of multilingual translation systems. Yet, the prevailing practice of averaging scores across languages raises questions. Evaluations might be biased, with translations of similar quality receiving disparate scores. Up until now, this issue has largely flown under the radar due to the absence of a solid benchmark that offers parallel-quality instances across languages.

The Birth of XQ-MEval

Enter XQ-MEval, a semi-automatically constructed dataset that aims to redefine translation metric evaluation. Covering nine translation directions, XQ-MEval offers a fresh approach to benchmarking. By injecting MQM-defined errors into pristine translations and filtering them through native speakers, this dataset creates pseudo translations with controllable quality. These are then paired with the source and reference, forming triplets that serve as a more reliable standard for assessing translation metrics.

But why does this matter? Because the traditional method of evaluation doesn't survive scrutiny. It fails to account for the inherent inconsistencies between metrics and human judgment. XQ-MEval presents empirical evidence of this cross-lingual scoring bias, turning a spotlight on a problem long ignored.

Revealing the Inconsistencies

Through experiments with nine representative metrics, XQ-MEval reveals a disconcerting truth. The inconsistency between averaging scores and human judgment stands exposed. Color me skeptical, but can we continue trusting these biased metrics? The dataset doesn't just highlight the problem. it also proposes a solution. A normalization strategy derived from XQ-MEval offers a way to align score distributions across languages, ultimately enhancing the fairness and reliability of multilingual metric evaluations.

Why Should We Care?

For those invested in the development of multilingual systems, the implications are significant. We must question the accuracy of current metrics and consider the impact of biased evaluations on global communication. What they're not telling you is that these metrics, left unchecked, could perpetuate disparities and hinder effective cross-cultural exchanges.

, XQ-MEval doesn't merely fill a gap translation metrics. It calls for a fundamental reevaluation of how we assess multilingual systems. As the global landscape becomes increasingly interconnected, ensuring that our tools are fair and reliable isn't just a technical challenge. It's a necessity.

Unmasking Bias in Multilingual Translation Metrics

The Birth of XQ-MEval

Revealing the Inconsistencies

Why Should We Care?

Key Terms Explained