Are Multimodal Models Missing the Creative Mark?
Multimodal models are advancing fast, but are they really understanding visual creativity? A new benchmark highlights their struggles and proposes a solution.
We live in an age where large multimodal models (LMMs) are rapidly advancing, deftly navigating perception and reasoning. But are they truly evolving beyond pattern recognition to embrace creativity in open-ended environments? The builders never left, but maybe they're missing something important.
Introducing MM-CreativityBench
Enter MM-CreativityBench, a new benchmark designed to push these models beyond their comfort zone. Here, intelligence isn't about just answering questions. It's about identifying how scene elements can be repurposed in unexpected, yet feasible ways. This kind of creative problem-solving mirrors human intelligence but remains largely untested in existing benchmarks.
MM-CreativityBench presents scenarios with structured views of items and their parts, challenging models to inspect, identify affordances, and devise grounded solutions. But current LMMs are stumbling. Not because they can't generate ideas, but because they struggle with grounded exploration. They often overlook key entities, miss critical parts, or hallucinate attributes not present in the images.
The Affordance-Grounded Solution
So, what's the fix? Affordance-grounded alignment. This approach reframes creative tool use as a preference learning problem. It emphasizes reasoning tied to visual evidence rather than imagined attributes. Using Direct Preference Optimization, models are nudged to favor real affordance reasoning.
There's also a layer of supervision from an affordance knowledge base. This guides models to explore entities widely and engage in multi-turn planning. The results? Noticeable improvements. Models are getting better at selecting the right entities and parts while cutting down on hallucination and grounding errors.
Why This Matters
Here's the question: if these models can't creatively interact with visual environments, are they really the future we envision? For all the talks about AI's potential, the gap between perception and genuine creative application is glaring. The meta shifted. Keep up. This is what onboarding actually looks like.
As AI continues to infiltrate various sectors, from gaming to industry applications, the demand for models that understand and creatively respond to visual cues will only grow. The key takeaway? Floor price is a distraction. Watch the utility. If LMMs are to live up to their hype, they need to genuinely 'see' and not just 'look'.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A standardized test used to measure and compare AI model performance.
Connecting an AI model's outputs to verified, factual information sources.
When an AI model generates confident-sounding but factually incorrect or completely fabricated information.
AI models that can understand and generate multiple types of data — text, images, audio, video.