When AI Plays Judge: Is NLG Evaluation Falling Short?

There's no denying it: Natural Language Generation (NLG) is at the forefront of natural language processing today. But as this field grows, so does the challenge of evaluating its advances. By 2025, automated systems have outstripped human assessments in research papers. This shift is pushing us to rethink how we measure progress in NLG.

The Automation Advantage

In a world where speed and scalability matter, LLM-as-a-judge (LaaJ) has become the go-to method for evaluating NLG. It's not hard to see why. Analyzing data from 14,171 papers across major NLP conferences from 2020 to 2025 shows LaaJ's dominance. But does this automated oversight deliver the insight we need?

The story looks different from Nairobi. Automation doesn't mean the same thing everywhere. While LaaJ might be a boon in speeding up evaluations, the ground reality reveals some cracks. Researchers found that only about 8% of papers had human validation alongside LaaJ. When machines lead the charge, are we sacrificing nuanced quality checks for convenience?

Old Metrics in a New World

Despite the leap towards open-ended text generation, many papers still cling to outdated metrics like BLEU and ROUGE. These were fine in their time but don’t capture the richness of contemporary NLG. It's like using a yardstick for something that needs a microscope.

So why stick with them? The farmer I spoke with put it simply: old habits die hard. Legacy metrics offer a sense of comfort, but they don't always reflect true quality or intent. The persistence of these metrics means researchers may miss out on a fuller picture of what NLG can achieve.

Rethinking Evaluation

Here's the crux: while LaaJ correlates with broader quality indicators, it struggles with specifics like text fluency. This gap is problematic. Automation isn’t about replacing workers. It's about reach. If LaaJ can't capture all aspects of quality, then it's not reaching its full potential.

A proposed Evaluation Checklist aims to bridge these gaps, helping researchers choose better metrics and validate their findings more effectively. But will this make a difference? Can checklists drive the change needed, or do we need a more fundamental shift in how we think about evaluation?

Ultimately, the question isn't if LaaJ is here to stay. It's how we can make sure it serves us best. In the rush to automate, the human touch remains vital. As we move forward, balancing these forces will shape the future of NLG evaluation. The stakes are high, and the outcome will affect how we harness the power of language models worldwide.

When AI Plays Judge: Is NLG Evaluation Falling Short?

The Automation Advantage

Old Metrics in a New World

Rethinking Evaluation

Key Terms Explained