Why MLLMs Struggle to Use Physical Tools and What It...

landscape of artificial intelligence, Multimodal Large Language Models (MLLMs) are touted as the brains guiding robots as they navigate our physical world. Yet, something as fundamental as using physical tools, these models are floundering. A striking revelation from a recent benchmark, PhysTool-Bench, has laid bare these shortcomings, challenging our expectations of AI's real-world practicality.

The PhysTool-Bench Revelation

The PhysTool-Bench is a groundbreaking benchmark designed to put MLLMs through their paces, evaluating their capability to understand real-world scenarios, identify physical tools, and plan their use. It involves 2,510 queries over 2,678 real-world physical tools across a multitude of domains such as manufacturing, electrical work, agriculture, and healthcare. Yet, the results are eyebrow-raising at best: even the most advanced MLLM, Gemini-3.1-Pro, could only identify a paltry 58.7% of tools and managed a meager 21.0% of queries end-to-end.

Why the Struggle?

So why do these models stumble in the face of such seemingly basic tasks? There's a dual-layered problem at play. First, MLLMs aren't perceiving tools in realistic scenes effectively. But the more significant issue surfaces at the planning stage. The models struggle to map the perceived tools onto task semantics, revealing a critical lack of functional commonsense. This isn't just a theoretical shortfall. it's a practical bottleneck that limits the development of truly effective embodied AI.

Implications for the Future

This predicament raises an urgent question: how can we expect AI to meaningfully assist in complex environments if it can't master the basics of tool use? The gap revealed by PhysTool-Bench isn't just a technical challenge but a barrier to the transformative potential of AI in industries like healthcare and agriculture, where practical tool use is essential.

To enjoy AI, you'll have to enjoy failure too. The journey of AI isn't a straight line, and setbacks like these force us to confront the limitations with a renewed sense of urgency. They also open up an opportunity, an undeniable need for innovation and improvement in how AI models are trained to interact with the physical world.

As researchers and developers continue to push boundaries, the focus should increasingly turn towards embedding a deeper understanding of context and functionality into these models. The proof of concept is the survival. If AI is to become the helper it promises to be, overcoming these hurdles isn't just important. it's necessary.

Why MLLMs Struggle to Use Physical Tools and What It Means for AI

The PhysTool-Bench Revelation

Why the Struggle?

Implications for the Future

Key Terms Explained