Rethinking Multi-Intent Detection: Beyond Familiar Patterns

Multi-intent detection isn't just about identifying multiple intents in a single utterance anymore. The real test lies in recognizing new combinations of familiar intents, a challenge that's been largely overlooked until now. Traditional benchmarks often fall short, as they don't stray far from familiar co-occurrence patterns. Enter CoMIX-Shift, a new benchmark designed to push the limits of compositional generalization in multi-intent detection.

Benchmarking the Unseen

CoMIX-Shift introduces a series of stress tests: held-out intent pairs, discourse-pattern shifts, longer and noisier utterances, clause templates that models haven't seen before, and the daunting zero-shot triples. These aren't your standard tests. They're crafted to measure a model's ability to adapt and generalize beyond its training.

ClauseCompose, a lightweight decoder trained solely on single intents, outshines the competition. It achieves a 95.7% exact match on unseen intent pairs and 93.9% on discourse-shifted pairs. In stark contrast, traditional whole-utterance baselines, including a fine-tuned tiny BERT model, lag significantly. The numbers tell a different story here, highlighting the importance of compositional evaluation in model assessment.

Why It Matters

Why is this essential? In the real world, language isn't neatly packaged. People mix and match intents, often in unpredictable ways. For applications from virtual assistants to customer service, the ability to accurately decipher these intent combinations can make or break user experience. Strip away the marketing and you get models that need to be as adaptable as the people using them.

Yet, many existing models haven't been up to the task. WholeMultiLabel, for instance, managed a mere 81.4% on unseen intent pairs and an abysmal 0.0% on unseen triples. Even the BERT baseline faltered at zero-shot triples. Frankly, it's clear that the architecture matters more than the parameter count when evaluating true multi-intent capabilities.

Looking Ahead

So, what's next for multi-intent detection? The reality is, we need more benchmarks like CoMIX-Shift, focused on compositional evaluation. Simple factorization techniques, as shown by ClauseCompose, can go surprisingly far when the evaluation criteria demand it. But can the industry shift its mindset to prioritize these more rigorous tests? Only those willing to adapt will stay ahead.

Rethinking Multi-Intent Detection: Beyond Familiar Patterns

Benchmarking the Unseen

Why It Matters

Looking Ahead

Key Terms Explained