PERIA: Advancing Spatial Reasoning in Vision-Language Models

The frontier of artificial intelligence in understanding and interacting with the world through vision-language models is continuously evolving. Despite the progress, these models often fall short spatial reasoning, an area that demands active engagement and dynamic interaction with visual data. This gap has led to the development of the PERception-Interaction-reason Agent, or PERIA, which aims to bridge this divide.

Understanding PERIA's Approach

PERIA isn't your average visual agent. It leverages a tool-augmented framework to enhance spatial reasoning tasks. These tasks range from complex map reasoning to intricate visual probing and reality-bending vision reconstruction. The essence of PERIA’s method is its dual tool families. Vision perception tools are tasked with revealing textual, symbolic, and spatial data, while vision interaction tools manage visual context alterations, path tracing, and spatial relation verification.

Training such a sophisticated agent requires an equally meticulous strategy. PERIA employs a unified recipe that incorporates supervised tool-use trajectory synthesis alongside composite rewards. What's more, it adopts an innovative training method known as Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO). This combination not only refines its tool-use behavior but also enhances its spatial reasoning performance significantly.

The Results Speak Volumes

When we look at the numbers, PERIA’s impact becomes clear. Testing across 13 benchmarks from 8 distinct datasets revealed a 10.0% improvement over the Qwen3-8B backbone in in-distribution benchmarks and a 4.4% leap in out-of-distribution scenarios. More impressively, PERIA surpasses previous leading models of its size, with gains ranging from 7.0% to 14.8%. It even competes head-to-head with larger models like Qwen3-VL-235B-A22B-Thinking and GPT-5.

Why Does This Matter?

The implications of PERIA's success are profound for the future of AI. While the industry often chases bigger models with larger datasets, PERIA underscores an alternative path: the integration of specialized tools to amplify model capabilities without escalating size. Isn't it time we asked if we're focusing too much on model size rather than functionality?

This approach not only challenges the status quo but also raises a important question about the future of AI development: Should we prioritize tool augmentation over mere scale? The success of PERIA suggests that perhaps we should. The reserve composition matters more than the peg, as it aligns with the principle that the quality of tools and data integration can supersede sheer model magnitude.

PERIA: Advancing Spatial Reasoning in Vision-Language Models

Understanding PERIA's Approach

The Results Speak Volumes

Why Does This Matter?

Key Terms Explained