Grounding Robots: Why Vision-Language Models Aren't...

Safe human-robot collaboration hinges on more than just pretty visuals. It requires a system that can discern if robots are keeping a safe distance, already in contact, or about to collide. This capability is called collision grounding. It's the process of linking what a robot sees to its body geometry, the camera's viewpoint, the scene, and human proximity to predict contact.

The Benchmark: TouchSafeBench

Enter TouchSafeBench, a physics-grounded benchmark designed to test collision grounding in vision-language models (VLMs). Developed within Habitat 3.0, it offers a rich dataset of 2,940 simulated indoor episodes showcasing social navigation and rearrangement. The episodes provide synchronized multi-view RGB-D observations, trajectory maps, and contact labels derived from the simulator.

TouchSafeBench aims to tackle two tasks key for deployment: classifying the robot's current safety state and predicting imminent collisions before they occur. But the findings are somewhat disheartening. Among three frontier VLMs and nine visual representations, the best average Macro-F1 score hovers below 50%. This isn't just a numbers game, it's a wake-up call.

Challenges and Insights

Why are VLMs struggling? Explicit depth data fails to automatically convert into concrete evidence of robot-body collisions. Moreover, determining robot-scene contact is consistently more challenging than gauging human-contact risk. TouchSafeBench highlights a critical flaw: visual fluency doesn't equate to physical accountability.

This gap raises a pertinent question: can we trust these models in critical scenarios if they can't reliably predict contact? The answer, for now, seems to be no.

Looking Forward: Building Better Models

For reliable safety, future models must explicitly integrate viewpoint, robot morphology, metric geometry, and predictive collision analysis. This isn’t an overstatement. It's the very foundation for ensuring safety in human-robot interactions.

TouchSafeBench will be released once it passes peer review. This benchmark is a step in the right direction, urging developers to enhance VLMs to truly bridge the gap between visual understanding and physical interaction. It's not just a challenge, it's a necessity.

Grounding Robots: Why Vision-Language Models Aren't There Yet

The Benchmark: TouchSafeBench

Challenges and Insights

Looking Forward: Building Better Models

Key Terms Explained