AI models that can understand and generate multiple types of data — text, images, audio, video.
AI models that can understand and generate multiple types of data — text, images, audio, video. GPT-4V, Gemini, and Claude 3 are multimodal models that can process both text and images. The trend is toward models that handle all modalities natively rather than through separate systems.
Multimodal AI systems can process and generate multiple types of data — text, images, audio, video — rather than being limited to a single modality. GPT-4V can look at images and answer questions about them. Gemini can process video. Claude can analyze charts and documents. These are all multimodal capabilities.
The shift toward multimodal is significant because the real world isn't text-only. A doctor needs AI that can look at X-rays and read patient notes. A developer wants AI that can see their UI mockup and write the code. A researcher needs AI that can read charts, tables, and equations alongside text. Limiting AI to just text means missing most of the information humans work with daily.
Building multimodal models is technically challenging because different modalities require different processing approaches. Images are grids of pixels, text is sequences of tokens, audio is waveforms. The model needs to align these different representations into a shared understanding. Current approaches use modality-specific encoders that feed into a shared transformer backbone. The frontier is moving toward models that natively think in multiple modalities rather than just bridging between them.
"We switched to a multimodal model so our document processing pipeline can handle scanned PDFs with images, tables, and handwritten notes — not just typed text."
Contrastive Language-Image Pre-training.
AI systems that create new content — text, images, audio, video, or code — rather than just analyzing or classifying existing data.
A mathematical function applied to a neuron's output that introduces non-linearity into the network.
An optimization algorithm that combines the best parts of two other methods — AdaGrad and RMSProp.
Artificial General Intelligence.
An autonomous AI system that can perceive its environment, make decisions, and take actions to achieve goals.
Browse our complete glossary or subscribe to our newsletter for the latest AI news and insights.