Bias in AI: Why Essay Scoring Models Often Miss the Mark

Large Language Models (LLMs) have been making waves in the field of educational assessment, sparking interest for their potential to simplify grading processes. However, a recent analysis casts doubt on their alignment with human scoring standards, particularly in analytic essay assessments.

The Study's Insights

The paper, published in Japanese, reveals that instruction-tuned LLMs were tested against three open essay-scoring datasets: ASAP 2.0, ELLIPSE, and DREsS. The data shows that while these models achieve moderate to high agreement with human scores in holistic assessments, with a Quadratic Weighted Kappa of around 0.6, this doesn't mean they perform equally well in all areas.

Notably, the models exhibited a pronounced negative bias when assessing Lower-Order Concerns (LOC), such as Grammar and Conventions. This bias suggests that LLMs often grade these traits more harshly than human raters, raising concerns about their fairness and reliability.

Bias and Its Implications

Western coverage has largely overlooked this significant issue: the models' tendency to underserve key educational metrics. The benchmark results speak for themselves. If models are consistently biased against certain writing traits, how can educators trust these tools for fair scoring?

This bias is easy to detect with small validation sets for LOC traits, but Higher-Order Concerns (HOC) like argumentation require larger samples for accurate bias detection. This distinction highlights the need for a nuanced approach to deploying LLMs in educational settings.

The Path Forward

What should educators and developers do with these findings? A bias-correction-first strategy is important. Rather than relying on raw zero-shot scores, it's essential to implement systematic score offsets using small human-labeled bias-estimation sets. Such an approach could mitigate bias without the need for extensive fine-tuning, making the deployment of LLMs both practical and equitable.

The question remains: Is it ethical to integrate these models into educational systems without addressing their inherent biases? Until these issues are resolved, the promise of AI in education might remain just that, a promise.

Bias in AI: Why Essay Scoring Models Often Miss the Mark

The Study's Insights

Bias and Its Implications

The Path Forward

Key Terms Explained