Rethinking LLM Coherence: Why Personality Tests Fall Short

understanding the behavioral tendencies of large language models (LLMs), traditional psychometric approaches might not be all they're cracked up to be. Recent studies have attempted to use low-cost self-reporting methods, like the Big 5 personality traits, to anticipate how these models behave. But the results show a glaring mismatch. The reality is, Big 5 traits aren't great predictors, even for humans.

Big 5 vs. Theory of Planned Behavior

So, what's the alternative? Researchers are turning to the Theory of Planned Behavior (TPB), which better predicts specific human actions. In a series of experiments across four tasks and 11 advanced LLMs, TPB showed it could reach human-level coherence within a single conversation. The numbers tell a different story for the Big 5, which lagged behind.

But here's where it gets interesting. When you break the conversation into separate sessions, coherence doesn't hold up unless the behavior emerges from prior training, like implicit biases. If the behavior is heavily influenced by the immediate context, coherence collapses. This raises a critical question: Are these models just mimicking context, or do they've underlying behavioral patterns?

The Role of Persona Prompting

Some might argue that persona prompting can help. It does make self-reports more consistent across conversations, but it doesn't align behavior with these reports. So, while persona prompting looks promising as a tool for consistency, it's not the fix we need for coherent behavior predictions.

Strip away the marketing and you get this: Broad personality frameworks like the Big 5 might not be the best tools for modeling LLM behavior. Instead, we need more targeted instruments, focused on specific tasks and evaluated across various contexts.

Why This Matters

Why should we care? As LLMs become more integrated into daily applications, knowing how they'll behave isn't just an academic exercise. It's vital for safe deployment. If we can't predict their actions reliably, we risk unforeseen consequences.

The bottom line? The architecture matters more than the parameter count. We need to prioritize precision in behavioral prediction to ensure these models serve us well. It's time to rethink our approach and invest in methodologies that genuinely capture the nuances of LLM behavior.

Rethinking LLM Coherence: Why Personality Tests Fall Short

Big 5 vs. Theory of Planned Behavior

The Role of Persona Prompting

Why This Matters

Key Terms Explained