GroundAct: Revealing the Gaps in LLMs' Action Understanding

Large Language Models (LLMs) like Qwen2.5-3B have been making waves with their ability to tackle tasks specified clearly in instructions. Yet, when the feasibility of an action hinges on the nuances of an environment which isn't directly mentioned in those instructions, their performance takes a nosedive. This isn't just a dip, it's a chasm. Success rates plummet from 85-96% down to a dismal 29-53%. What's going on here?

Introducing GroundAct

The paper's key contribution: GroundAct, a comprehensive benchmark, introduces 1,500 scenarios with 16,592 task instances. These tasks span 11 domains and range across a cognitive complexity hierarchy. If we're to push the boundaries of what LLMs can do, understanding their limitations across these scenarios is essential.

Three diagnostic patterns emerged from evaluating 15 different LLMs, ranging from 3 billion to a whopping 671 billion parameters. First, there's a weak correlation between attribute reasoning and more complex tool and coordination reasoning. This points to the distinct profiles of these models. Second, when complete environment graphs are available, there's a notable improvement in tool use outcomes by up to 27.6%. However, implicit collaboration suffers significantly, dropping by 22.9%. Lastly, supervised fine-tuning dramatically boosts Qwen2.5-3B's direct command performance from a meager 0.6% to 76.3%. But it's still stuck in the mud for implicit collaboration, with just a nudge from 1.5% to 5.5%.

The Significance of Action Grounding

What they did, why it matters, what's missing. Action grounding isn't a trivial challenge that scales linearly with model size. It's a multifaceted issue demanding nuanced solutions. With GroundAct, researchers now have a detailed map of where LLMs falter and where they're thriving. This creates the opportunity for targeted improvements rather than relying on sheer size and computational heft.

The key finding here's that simply supersizing models doesn't equate to better performance in all scenarios. If LLMs are to operate in dynamic and unpredictable environments, they need more than just massive datasets. They need an inherent understanding of action feasibility.

Beyond the Hype

The ablation study reveals where these models trip and where they stride confidently. So, here's the burning question: Are we ready to reconsider how we train and evaluate these models? GroundAct might just be the benchmark to push the boundaries of what's possible with LLMs. It's not just about bigger models but smarter ones.

GroundAct: Revealing the Gaps in LLMs' Action Understanding

Introducing GroundAct

The Significance of Action Grounding

Beyond the Hype

Key Terms Explained