Reasmory: The Future of Spatial Reasoning in Vision-Language Models
Reasmory is changing how vision-language models handle spatial reasoning. By using structured program execution over 3D memory, it outshines the competition.
JUST IN: Vision-Language Models (VLMs) have been making strides, but they're not quite there yet spatial reasoning. Tasks like viewpoint reasoning and distance estimation trip them up. Why? Because spatial cues are scattered and hard to pin down.
The Problem with Current Models
Existing models often fail to bring together the scattered spatial data found in multi-view images and videos. The result? They struggle with accurate spatial understanding. Even when reconstruction-based Vision Foundation Models (VFMs) try to help, they often fall short. Exposing models to free-form tools without guidance leads to missteps.
Enter Reasmory
Reasmory is a big deal. By framing spatial reasoning as a structured program execution, it creates 3D memory that's more reliable. How? It constructs explicit 3D memory and enhances it with 3D object instances. Plus, it introduces a Domain-Specific Language (DSL) that keeps the process in check. This means no more wild guesses.
And the results are there. Reasmory shows gains of 6-18% over strong baselines like GPT-5-mini and Gemini-3-flash. That's no small feat. It proves that structured, validated operations beat out free-form tool use any day.
Why Does This Matter?
This changes the landscape. VLMs with enhanced spatial understanding can revolutionize how we interact with technology. From autonomous vehicles to AR experiences, the potential is massive. But here's the kicker: Will other labs catch up, or is Reasmory setting a new standard everyone else will scramble to meet?
The labs are scrambling, and just like that, the leaderboard shifts. Reasmory isn't just a step forward. It could be the leap the industry needs.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
Google's flagship multimodal AI model family, developed by Google DeepMind.
Generative Pre-trained Transformer.
The ability of AI models to draw conclusions, solve problems logically, and work through multi-step challenges.
The ability of AI models to interact with external tools and systems — browsing the web, running code, querying APIs, reading files.