Vision-Language Models Struggle with Predicting the Future

Unified vision-language models (VLMs) are impressive, no doubt about it. But predicting future states from current images and verbal instructions, they're stumbling. The question is, why do these models struggle to forecast what happens next in an image sequence?

The Struggle with Forward Dynamics

The problem lies in a concept called forward dynamics prediction (FDP). It's the idea of predicting the next image in a sequence given the previous one and a specified action. Currently, VLMs don't quite cut it in crafting physically plausible transitions. Imagine asking a painter to create a future scene with only a verbal cue. It's not easy!

Yet, there's a silver lining. VLMs are pretty sharp at inverse dynamics prediction (IDP). This is all about captioning the action between frames. Essentially, they can describe what's happening in reverse with a lot more ease. Why does this matter? Because IDP can act as a springboard to enhance FDP.

Bootstrapping the Future with IDP

Here's where it gets clever. By fine-tuning VLMs to focus on IDP, researchers have found two exciting strategies to nudge them toward better FDP.

First, IDP can annotate actions in unlabeled video frames, creating more training data for FDP. Second, IDP can be used to evaluate multiple predictions during inference, essentially scoring them to determine which are most accurate.

A Competitive Edge in Image Editing

So, how do these strategies pan out in the real world? When tested on Aurora-Bench, a benchmark for action-centric image editing, VLMs that employed these IDP strategies delivered impressive results. They beat state-of-the-art image editing models by 7% to 13% according to GPT4o's judgment. Not only that, but they also earned the highest average scores in human evaluations across all subsets of Aurora-Bench.

In the end, the lesson is clear: if nobody would play it without the model, the model won't save it. VLMs need to be enjoyable and accurate, both in practice and prediction. While the current models aren't quite there with future predictions, their prowess in reverse engineering offers a roadmap to improvement.

So, the question remains: How long until these models catch up with our wildest visual predictions? Let's hope it's sooner rather than later. After all, retention curves don't lie!

Vision-Language Models Struggle with Predicting the Future

The Struggle with Forward Dynamics

Bootstrapping the Future with IDP

A Competitive Edge in Image Editing

Key Terms Explained