The BESPOKE Benchmark: Personalization in AI Search

Search-augmented large language models (LLMs) are the shiny new tool in the AI toolbox, promising to help us find information faster and smarter. They're integrating retrieval with generation, reducing the mental gymnastics we used to do with traditional search systems. But are they really serving everyone's needs? The truth is, not quite yet.

Introducing BESPOKE

That's where BESPOKE comes in. Think of it as a new standard for evaluating how well these AI models personalize their responses. BESPOKE is built on real chat and search histories from actual humans, not some synthetic dataset cooked up in a lab. It's diagnostic, pairing model responses with preference scores and feedback straight from the users. This isn't just theory. it's grounded in real-world interactions.

What's the Big Deal?

Why should you care about BESPOKE? Because it highlights the gap between what we hope AI can do and what it actually delivers. Sure, systems like ChatGPT and Gemini are trying to personalize interactions by using our histories. But without a benchmark like BESPOKE to measure if they're hitting the mark, these efforts feel like shots in the dark. Can they really understand that two users with the same query could want entirely different things?

BESPOKE's thorough human annotation process offers a reality check. Annotators didn't just toss in random data. they contributed their own histories and authored detailed queries. They scored responses and provided diagnostic feedback, ensuring the evaluation isn't just surface level.

Why It Matters

The potential for personalization in AI is huge. Imagine an AI that not only knows what you're asking but understands why you're asking it. That could transform how we interact with technology. But so far, the gap between the keynote and the cubicle is enormous. Management bought the licenses for these fancy tools, but what does the internal Slack channel really look like? Probably full of questions and complaints about how these tools don't quite get it yet.

If AI is to genuinely enhance our productivity and workflow, it needs to be evaluated on how well it personalizes and adapts to diverse user needs. BESPOKE could provide that much-needed foundation. But the real question is, will companies pay attention? Or will they continue to push tools that haven't quite evolved to meet user needs?