Unlocking the Black Box: A Fresh Take on Automated Scoring Models
A new framework combining Shapley-value attributions and LLM-generated rationales offers insights into rubric-based scoring models. Here’s why it matters.
automated scoring models, where assessing complex language performances like classroom transcripts often feels like peering into a black box, a new framework aims to shed light on the inner workings of these systems. This framework, which merges model-agnostic Shapley-value attributions with rationales from large language models (LLMs), promises to unravel the why behind the scores assigned by these models.
Interpreting the Scores
automated scoring, accuracy is only half the battle. Understanding why a model assigns a particular score is equally critical, especially in high-stakes environments like education. The proposed framework, tested on the Quality of Feedback dimension using the NCTE corpus, provides a systematic approach to compare fine-tuned pretrained language models (PLMs) and prompted LLMs. The findings? Fine-tuned PLMs outperform in prediction accuracy across 6,000 annotated transcript segments, yet they tend to compress scores toward the middle. This compression could potentially mask the nuances in evaluation.
The Strength of SHAP
Deletion-based tests reveal that SHAP, a popular interpretability method, identifies key sentences that strongly influence model predictions. This is where it gets interesting: SHAP not only produces more substantial and coherent shifts in predictions but also transfers these insights robustly across different model architectures. On the other hand, LLM-generated rationales, while innovative, struggle to exert consistent influence across models. The deeper question here's: When does interpretability trump innovation?
Why It Matters
At a time when education systems increasingly rely on automated assessments, understanding these models' decision-making processes is important. The framework’s ability to provide more faithful and transferable explanations could transform how educators and policymakers evaluate scoring models. For those who believe that a model’s transparency is as vital as its accuracy, this approach is a big deal. Yet, one must ask: Will educators and tech developers embrace this nuanced understanding, or will they prioritize raw accuracy over interpretability?
Ultimately, the framework offers not just a technical advancement but a philosophical stance on the value of transparency in AI-driven assessments. By choosing SHAP as the cornerstone for explanation, the study underscores the importance of explanations that are both accurate and understandable, urging stakeholders to rethink how they approach automated scoring.
Get AI news in your inbox
Daily digest of what matters in AI.