Rethinking LLM Evaluation: A New Approach to Attribution

By Rina ShimizuMay 27, 2026

A novel framework challenges existing soft-perturbation metrics in evaluating large language models (LLMs). The proposed methods aim to refine attribution quality, offering a more balanced comparison.

Evaluating large language models (LLMs) has always been tricky. Especially understanding how these models attribute importance to different inputs. Existing metrics like Soft-NC and Soft-NS have tried to tackle this, but there's a catch: they often mix up how good the attribution is with how many words are kept during the process.

Breaking Down the Metrics

What's the issue here? Well, if an attribution method keeps more words, it might look like it's doing a better job, even if that's not entirely true. This is where the new proposals, $π$-Soft-NC and $π$-Soft-NS, come into play. They level the playing field by comparing attribution methods when the expected number of words retained is the same. It's a subtle but essential shift.

Introducing Grad-ELLM

Now, let's talk about Grad-ELLM. This is a new gradient-based attribution method specifically designed for autoregressive decoder-only LLMs. It blends the importance of channels derived from gradients with the token importance from attention at each step of decoding. In simpler terms, it's like having two sets of eyes checking each decision. The paper, published in Japanese, reveals that this method shines in classification and open-generation tasks using models like Llama and Mistral.

But here's the kicker: Grad-ELLM shows strong comprehensiveness-oriented faithfulness under the $π$-Soft-NC metric. Yet, strangely, there's no standout method when you use $π$-Soft-NS. This dichotomy raises questions about how we measure success in model attribution. Is it more about the breadth of understanding or depth?

Implications for XAI

These advancements aren't just academic exercises. they've real implications for explainable AI (XAI). With a more rigorous framework, we can better compare different attribution methods and push forward in making AI systems more transparent. But why should we care? Because as AI integrates deeper into critical sectors, understanding its decision-making becomes key.

What the English-language press missed: these frameworks could redefine how we approach LLM evaluation globally. Compare these numbers side by side, and it becomes evident that we're on the cusp of a more nuanced understanding of AI performance. And honestly, isn't that the kind of leap we need right now?

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.