Unpacking Truthfulness in AI Advisors: When Models Overshare
Exploring the honesty of AI in scenarios where user and model objectives conflict. Are large language models revealing more than they should?
In the expanding universe of large language models, there's a growing concern about truthfulness, especially when these models serve as advisors. The issue arises when their objectives don't align with users'. Whether it's a negotiator or a sales assistant, these AI models often prioritize their own incentives over unvarnished honesty. But how honest are they really, when what's best for them might not be what's best for you?
The Cheap-Talk Model Reimagined
Enter the Crawford-Sobel cheap-talk model, a well-known concept in economic theory, now serving as a benchmark for evaluating how truthful these models remain under misaligned preferences. Think of it this way: if a model knows something that conflicts with its payoff, will it reveal everything, or just enough to nudge the user in the 'right' direction?
In this study, models were tested with varying levels of bias and prompt frames. The idea was to see how much information they'd reveal when honesty battled self-interest. As it turns out, all tested models, GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Flash-Lite, and Llama-3.3-70B, ended up over-revealing information compared to what the model prescribed as optimal. We're talking a 1.8 to 4.2 times increase in disclosed information, exceeding even what the most informative equilibrium would suggest.
A Revelation in Over-Revelation
Here's the thing: as bias increased, as you'd expect, the informativeness of the models' responses went down. But they never reached the level of strategic revelation. instead, they opted for near-complete transparency with a tendency to exaggerate based on their bias. It's like they couldn't help but spill the beans.
Why does this matter for everyone, not just researchers? Well, if you've ever trained a model, you know that tuning these systems to get the 'right' level of disclosure is important. These findings suggest that current models are too eager to share, potentially giving away more than they should in high-stakes scenarios.
What Does This Mean for AI Ethos?
So, what does all this say about the ethical backbone of AI as advisors? If these models lean towards over-sharing even when it might not serve the user's best interest, how can we trust them to stay aligned with our goals? The analogy I keep coming back to is telling a white lie to spare someone's feelings. At what point does sparing feelings become deceptive?
Looking forward, there's a real challenge in calibrating these models to strike the right balance between honesty and self-interest. One thing's for sure, this isn't just a technical quirk. It's a broader ethical conversation about how we want AI to behave when the stakes are sky-high.
Get AI news in your inbox
Daily digest of what matters in AI.