The Rise of Multimodal Models: A Story of Late Bloomers
Multimodal models are finally gaining traction in open LLM families, but the journey's been anything but straightforward.
large language models, the rise of multimodal capabilities has been slow but sure. As we dive into the data, it's clear: the real story is how these capabilities are catching on, and, spoiler alert, it's not in the way most would expect.
The Late Bloomers
While many tech enthusiasts were quick to herald the arrival of multimodal models, the reality has been more about steady groundwork than explosive entrance. In fact, even as these models were making headlines, true adoption within major open LLM families lagged behind. By the end of 2023 and well into 2024, multimodal capabilities were still rare. It wasn't until 2024-2025 that we saw a sharp increase, primarily driven by image-text vision-language tasks.
Take the Gemma family, for example. Their first vision-language model variant popped up about a month after their initial text-generation releases. Other families took over a year or more to catch up. GLM was notably slow, lagging by a staggering 26 months. Those are numbers that tell a story of cautious evolution rather than rushed innovation.
Inside the Model Families
If you're picturing these multimodal models springing from text-generation roots, think again. The data shows a different narrative. A mere 0.218% of fine-tuning efforts from text-generation models resulted in vision-language model (VLM) descendants. Instead, 94.5% of the VLM offshoots came from existing VLMs, highlighting a preference for sticking to what works within the same lineage.
Most new VLM releases appear as fresh starts rather than offshoots from existing models. Around 60% of these models emerge without any recorded parent. What does this mean for the industry? It suggests that innovation in this area appears more like spontaneous bursts rather than a gradual build-up.
The Road Ahead
So, why should any of this matter to anyone outside the AI bubble? Well, the slow uptake and unique growth patterns of multimodal models could have broader implications for how companies plan their AI strategies. When the gaps between keynote promises and cubicle realities are this wide, how can organizations ensure they're not left behind?
The fragmented adoption and specific lineage evolution indicate that investing in AI requires more than just buying licenses. It's about understanding the intricate dynamics of these tools and having a clear strategy for integration. Otherwise, organizations risk falling into the trap of shiny new tech that never quite fits their needs.
The press release might talk about the wonders of AI transformation, but on the ground, the story is still unfolding. And as companies navigate these uncharted waters, they'll need to ask themselves: Are we ready to adapt, or are we just along for the ride?
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
The process of taking a pre-trained model and continuing to train it on a smaller, specific dataset to adapt it for a particular task or domain.
An AI model that understands and generates human language.
Large Language Model.
AI models that can understand and generate multiple types of data — text, images, audio, video.