Are Multimodal Models Missing the Creative Mark?

We live in an age where large multimodal models (LMMs) are rapidly advancing, deftly navigating perception and reasoning. But are they truly evolving beyond pattern recognition to embrace creativity in open-ended environments? The builders never left, but maybe they're missing something important.

Introducing MM-CreativityBench

Enter MM-CreativityBench, a new benchmark designed to push these models beyond their comfort zone. Here, intelligence isn't about just answering questions. It's about identifying how scene elements can be repurposed in unexpected, yet feasible ways. This kind of creative problem-solving mirrors human intelligence but remains largely untested in existing benchmarks.

MM-CreativityBench presents scenarios with structured views of items and their parts, challenging models to inspect, identify affordances, and devise grounded solutions. But current LMMs are stumbling. Not because they can't generate ideas, but because they struggle with grounded exploration. They often overlook key entities, miss critical parts, or hallucinate attributes not present in the images.

The Affordance-Grounded Solution

So, what's the fix? Affordance-grounded alignment. This approach reframes creative tool use as a preference learning problem. It emphasizes reasoning tied to visual evidence rather than imagined attributes. Using Direct Preference Optimization, models are nudged to favor real affordance reasoning.

There's also a layer of supervision from an affordance knowledge base. This guides models to explore entities widely and engage in multi-turn planning. The results? Noticeable improvements. Models are getting better at selecting the right entities and parts while cutting down on hallucination and grounding errors.

Why This Matters

Here's the question: if these models can't creatively interact with visual environments, are they really the future we envision? For all the talks about AI's potential, the gap between perception and genuine creative application is glaring. The meta shifted. Keep up. This is what onboarding actually looks like.

As AI continues to infiltrate various sectors, from gaming to industry applications, the demand for models that understand and creatively respond to visual cues will only grow. The key takeaway? Floor price is a distraction. Watch the utility. If LMMs are to live up to their hype, they need to genuinely 'see' and not just 'look'.

Are Multimodal Models Missing the Creative Mark?

Introducing MM-CreativityBench

The Affordance-Grounded Solution

Why This Matters

Key Terms Explained