Rethinking Diffusion in Vision-Language-Action Models

Diffusion-based vision-language-action (VLA) models are often viewed through the lens of image generation, where actions are produced through a series of iterative denoising steps. This perspective, however, may not hold up under scrutiny for VLA models. The claim that sophisticated one-step methods are essential for effective action generation doesn't quite survive when we consider the unique condition-target structure of VLA.

Challenging the Status Quo

These models are conditioned on a rich mixture of observations, language inputs, and state information. Yet, they output something much more succinct: a low-dimensional action chunk. So why are we borrowing complex one-step techniques from image synthesis here? Researchers propose a more straightforward approach, sticking with the standard velocity prediction and resisting the temptation to complicate the process with teacher models, distillation stages, or auxiliary objectives.

What they're not telling you: by simply biasing the training time distribution toward high-noise states, we can achieve remarkable results. In controlled experiments using an MNIST grid-to-sequence task, this method not only held its ground but thrived, particularly when tested in extensive robot-policy experiments across various LIBERO settings.

Results That Speak Volumes

On standard LIBERO datasets, one-step policies trained with this high-noise biased schedule managed to match and sometimes even surpass the performance of ten-step decoding strategies. A noteworthy achievement was observed in real-robot evaluations using a bimanual YAM RSS setup, offering a small-sample cross-architecture confirmation of this trend. These findings suggest that the traditional dependence on complex multi-step diffusion strategies for image generation may not be necessary for action generation in VLA models.

A particular standout is the performance on a substantial 1.4B VLM model with a 30M action head, where one-step decoding soared to 95.6% on LIBERO-Long. Such results indicate that solid one-step VLA action generation can indeed arise from conventional diffusion training without the need for importing intricate machinery designed for image synthesis.

What's Next for VLA Models?

Color me skeptical, but this could very well be a turning point in how we approach action generation in VLA settings. Why complicate when simplicity yields such success? Perhaps it's time to rethink the methodologies we often take for granted. Do we really need to import complexity from unrelated fields when simpler, tailored solutions exist?

As the field evolves, it's essential to continue questioning established norms and exploring new avenues that might offer more efficient, effective paths forward. The path of least resistance could very well be the path of greatest success.

Rethinking Diffusion in Vision-Language-Action Models

Challenging the Status Quo

Results That Speak Volumes

What's Next for VLA Models?

Key Terms Explained