What It Is
A "modality" is a type of data: text, images, audio, video, code, 3D models. Traditional AI models handle one modality. A language model processes text. An image classifier processes images. Multimodal AI handles multiple modalities at once.
When you upload a photo to Claude or GPT-4 and ask "What's happening in this image?", that's multimodal AI. The model processes the image and generates a text response. When you ask Gemini to analyze a video, or when Sora generates video from a text description, that's also multimodal.
The key insight is that the real world is inherently multimodal. Humans don't process sight, sound, and language in isolation — we combine them. AI is finally catching up.
Why It Matters
Text-only AI has a major blind spot: most of the world's information isn't text. Charts, photographs, diagrams, videos, audio recordings, handwritten notes — a text-only model can't touch any of it. Multimodal models can.
This opens up applications that were impossible before: analyzing medical images while reading patient records, understanding financial charts alongside earnings reports, processing meeting recordings with slides and transcripts together. The model gets the full picture, not just fragments.
How It Works
There are several approaches to making models multimodal:
Early fusion: Convert all modalities into the same representation space from the start. Images get split into patches and encoded as tokens, just like words. The transformer processes everything in a unified sequence. Google's Gemini takes this approach — it's natively multimodal.
Adapter approach: Take a pre-trained language model and attach vision encoders. The image encoder (often a CLIP or SigLIP model) converts images into embeddings that the language model can process alongside text. GPT-4V and Claude's vision use variations of this.
Cross-modal training: Models like CLIP were trained on image-text pairs, learning to associate images with their descriptions. This creates a shared embedding space where images and text are directly comparable.
Key Examples
GPT-4o: Processes text, images, and audio in a single model. Can have spoken conversations, analyze images, and generate speech natively.
Gemini: Google's natively multimodal model handles text, images, audio, video, and code. Can analyze long videos and answer questions about them.
Claude (Anthropic): Processes images and text, with strong document analysis capabilities — reading charts, screenshots, handwriting, and PDFs.
CLIP (OpenAI): The model that kicked off multimodal AI. It learns to connect images and text descriptions, powering image search and diffusion models.
Where to Go Next
- → Computer Vision — the vision side of multimodal
- → Embeddings — how different modalities share a space
- → Large Language Models — the text backbone
- → Diffusion Models — generating images from text