CrossVLA: A New Contender in Vision-Language-Action Models

By Callum BryceJune 9, 2026

CrossVLA breaks new ground by challenging the dominance of autoregressive VLAs. With a 10.4% performance boost and innovative caching tactics, it's a breakthrough.

JUST IN: A new study is shaking up the Vision-Language-Action (VLA) model scene. Meet CrossVLA, a bold step forward that’s making everyone rethink their strategies.

Breaking the Mold

Vision-Language-Action models have typically stuck to a couple of tried-and-tested patterns: discrete-token autoregression and continuous-action flow-matching. But CrossVLA is flipping the script. It introduces a surrogate flow-matching log-probability estimator, allowing Direct Preference Optimisation (DPO) to work its magic on continuous-action systems without the need for complicated probability-flow integration. That's a mouthful, but it means big things for model efficiency.

Performance Leap

Sources confirm: CrossVLA isn’t just talk. In a head-to-head comparison of parameter-efficient layers, DoRA outshines LoRA, delivering a mean 10.4 percentage point improvement over OpenVLA Soft Fine-Tuning. Just like that, the leaderboard shifts. In a massive trial of 600 runs across three seeds, DoRA scored an average of +20.0 on Object tasks, +11.0 on Long-horizon, +8.0 on Goal, and +2.7 on Spatial tasks. That’s a wild leap forward with zero variance in the Object category. How often do you see that?

The Latency Conundrum

Now, let's talk latency. CrossVLA’s denoise loop dominates a whopping 78.6% of sample_actions latency. It’s a bottleneck, sure, but it’s also where the magic happens. On the caching front, while new VLA-Cache strategies cap speed at a 21% acceleration, they can degrade success rates to a brutal 0-80%. Is this the price of speed?

A Glimpse Ahead

In a striking move, the team pretrained a multi-view and temporal projection head on 6000 LIBERO frames, achieving an almost perfect 99.5% k-NN recall for same-task retrieval. That’s 36 times better than random chance. And it’s all available for downstream use. The labs are scrambling to integrate these insights.

This changes VLA models. CrossVLA isn’t just another entry. It’s a powerful contender that challenges the status quo. Will it redefine what’s possible? Time for the big names in AI to take note.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.