Can AI Grade Like a Human? The Numbers Say No

By Nadia OkoroMarch 26, 20261 views

Large language models show promise in grading essays but struggle to match human accuracy. LLMs often overgrade short essays while undergrading longer ones.

Large language models (LLMs) have been touted as the future of automated essay grading. But do they really stack up against human graders? The latest findings suggest they don't quite hit the mark.

What the Data Shows

In an evaluation of models from the GPT and Llama families, LLMs were put to the test in their raw form, without any specialized training for grading tasks. The results revealed a significant gap between AI and human scoring. Specifically, LLMs are prone to assigning higher scores to shorter or less developed essays. Conversely, they often penalize longer essays for minor errors that humans might overlook.

Here's what the benchmarks actually show: LLMs exhibit a coherent pattern in their feedback and scoring. Essays that receive more praise tend to garner higher scores. Meanwhile, those criticized receive lower scores, indicating consistency in their feedback approach. However, this consistency doesn't translate to alignment with human grading standards.

The Reality of AI Scoring

So, why does this matter? For educational institutions considering automated grading, the reliance on LLMs could mean the difference between fair assessment and skewed results. While AI can handle some tasks with precision, grading nuanced human writing isn't one of them yet.

The architecture matters more than the parameter count. LLMs rely on different signals from human raters, leading to their misalignment with traditional grading practices. This misalignment poses a challenge for educators who seek fair and accurate assessment tools.

What’s Next for LLMs in Education?

Despite these shortcomings, LLMs aren't without merit. The feedback they provide aligns with their scoring, suggesting potential in supporting essay grading rather than replacing human graders. But should educators trust AI with their students' futures? The numbers tell a different story. Until these models can better mimic human evaluation, they remain a supplementary tool at best.

In the fast-evolving world of AI, the question isn't just about capability. It's about trust. Can we trust LLMs to grade as fairly and accurately as human educators? Right now, the answer seems to be no.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.