Can Vision-Language Models Really Predict Future States?

Unified vision-language models (VLMs) are often lauded for their potential to revolutionize artificial intelligence, but predicting future states of image sequences based on language instructions, they seem to hit a wall. These models struggle to create transitions between frames that aren't only visually coherent but also physically plausible.

The Challenge of Forward Dynamics

The task here's called forward dynamics prediction (FDP). It involves anticipating the next frame in an image sequence after a particular action, described in language, is taken. VLMs currently find this challenging, raising the question: are we asking too much of these models too soon?

Patient consent doesn't belong in a centralized database. But vision-language models, the data complexity isn't too far off. The models need to deal with the intricacies of both image and language data simultaneously. The FDA doesn't care about your chain. It cares about your audit trail. For VLMs, the audit trail is missing. We need an effective way to track the logic from instruction to visual execution.

Inverse Dynamics as a Solution?

Interestingly, there's an asymmetry in how these models handle multimodal tasks. Fine-tuning a VLM for inverse dynamics prediction (IDP), essentially captioning the action between frames, proves significantly easier. This easier task can bootstrap the more complex FDP through two strategies: weakly supervised learning from synthetic data and inference time verification.

Drug counterfeiting kills 500,000 people a year. That's the use case. But what about when the stakes aren't life and death? image editing, these models could be groundbreaking. IDP expands the training data for FDP by annotating actions in unlabeled video frames. At inference time, IDP assigns scores to various FDP samples, effectively guiding the model's search for the right prediction.

Promising Results, But Room for Growth

Evaluations on the Aurora-Bench, a benchmark for action-centric image editing, reveal promising results. Two families of VLMs were tested, resulting in performance that, while not surpassing state-of-the-art image editing models, is competitive. In fact, the top-performing model improved image editing accuracy by 7% to 13% according to GPT4o-as-judge, and it achieved the best average human evaluations across all subsets.

So why should this matter to you? With AI's growing role in automating complex tasks, understanding its limitations and potential is key. These findings suggest that while VLMs have a long way to go in some areas, their ability to learn in reverse can be harnessed to move forward. The question remains, how can we further refine these models to bridge the gap between understanding and execution?

Can Vision-Language Models Really Predict Future States?

The Challenge of Forward Dynamics

Inverse Dynamics as a Solution?

Promising Results, But Room for Growth

Key Terms Explained