CrossVLA: A New Contender in Vision-Language-Action Models
CrossVLA breaks new ground by challenging the dominance of autoregressive VLAs. With a 10.4% performance boost and innovative caching tactics, it's a breakthrough.
JUST IN: A new study is shaking up the Vision-Language-Action (VLA) model scene. Meet CrossVLA, a bold step forward that’s making everyone rethink their strategies.
Breaking the Mold
Vision-Language-Action models have typically stuck to a couple of tried-and-tested patterns: discrete-token autoregression and continuous-action flow-matching. But CrossVLA is flipping the script. It introduces a surrogate flow-matching log-probability estimator, allowing Direct Preference Optimisation (DPO) to work its magic on continuous-action systems without the need for complicated probability-flow integration. That's a mouthful, but it means big things for model efficiency.
Performance Leap
Sources confirm: CrossVLA isn’t just talk. In a head-to-head comparison of parameter-efficient layers, DoRA outshines LoRA, delivering a mean 10.4 percentage point improvement over OpenVLA Soft Fine-Tuning. Just like that, the leaderboard shifts. In a massive trial of 600 runs across three seeds, DoRA scored an average of +20.0 on Object tasks, +11.0 on Long-horizon, +8.0 on Goal, and +2.7 on Spatial tasks. That’s a wild leap forward with zero variance in the Object category. How often do you see that?
The Latency Conundrum
Now, let's talk latency. CrossVLA’s denoise loop dominates a whopping 78.6% of sample_actions latency. It’s a bottleneck, sure, but it’s also where the magic happens. On the caching front, while new VLA-Cache strategies cap speed at a 21% acceleration, they can degrade success rates to a brutal 0-80%. Is this the price of speed?
A Glimpse Ahead
In a striking move, the team pretrained a multi-view and temporal projection head on 6000 LIBERO frames, achieving an almost perfect 99.5% k-NN recall for same-task retrieval. That’s 36 times better than random chance. And it’s all available for downstream use. The labs are scrambling to integrate these insights.
This changes VLA models. CrossVLA isn’t just another entry. It’s a powerful contender that challenges the status quo. Will it redefine what’s possible? Time for the big names in AI to take note.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
Direct Preference Optimization.
The process of taking a pre-trained model and continuing to train it on a smaller, specific dataset to adapt it for a particular task or domain.
Low-Rank Adaptation.
A value the model learns during training — specifically, the weights and biases in neural network layers.