Unraveling the Truth in Open-Source LLMs: What Lies Beneath?

Recent research probes into whether open-source language models can encode a discernible truthfulness signal within their hidden states. The results are intriguing, hinting at the complex interplay between model architecture and truth detection.

The Findings

In the study, three instruction-tuned models - Llama-3.1-8B, Mistral-7B, and Qwen2.5-7B - were analyzed. These models, sporting a parameter count between 7 billion and 8 billion, were loaded using 4-bit NF4 quantization. Researchers sought to extract hidden states from these models and tested them across four hallucination benchmarks: TruthfulQA, HaluEval-QA, FEVER, and a controlled synthetic set.

Here's what the benchmarks actually show: Linear probes on mid-network layers achieved impressive AUROC scores ranging from 0.904 to 1.000 on held-out splits. This level of accuracy from such probes underscores the presence of a linearly separable truthfulness signal within these models. However, sampling-based detectors couldn't crack the 0.541 AUROC mark, exposing a clear gap in performance.

The Architecture Matters More Than the Parameter Count

Why the discrepancy? The reality is, the structure of these models plays a important role. Llama and Mistral models peaked around blocks 13 to 18, while Qwen hit its stride between blocks 19 and 25. It seems the architecture's design, rather than just sheer size, influences how truthfulness is encoded.

Interestingly, first-block attention entropy offered a notable complementary signal, especially in knowledge-grounded settings, achieving between 0.866 and 0.941 AUROC on HaluEval-QA. And there's no additional inference cost.

Implications for Future Models

What does all this mean for future language models? Simply put, the industry needs to rethink how models are evaluated for truthfulness. Is it time to shift away from sampling methods, which seem structurally misaligned with paired-label evaluations? The low performance of these methods suggests they might not be up to the task.

For developers and researchers, this research offers a new perspective. Rather than scaling up model sizes, focusing on the right layers and architectural tweaks could yield better truthfulness detection.

Ultimately, the numbers tell a different story than what sampling methods have suggested. It reinforces the need for innovation in both model structure and evaluation approaches. With code and data available for full reproducibility, this study paves the way for more targeted advancements in language model truthfulness.