What It Is
Computers see images as grids of numbers — each pixel has red, green, and blue values. Computer vision is the set of techniques that turn those raw numbers into understanding. "There's a cat sitting on a couch" isn't obvious to a computer. It has to learn what cats, couches, and "sitting on" mean from millions of examples.
The field covers everything from basic tasks like detecting edges to complex ones like understanding what's happening in a video, generating photorealistic images, or reconstructing 3D scenes from 2D photos.
Why It Matters
Vision is the primary way humans understand the world, and it's becoming the same for AI. Computer vision is behind face unlock on your phone, visual search in Google Photos, quality control in manufacturing, medical image analysis, autonomous driving, and augmented reality.
It's also one of the most mature areas of AI. While language models grabbed headlines recently, computer vision has been delivering real value in production for over a decade. Factories use it to spot defects. Hospitals use it to flag potential tumors. Agriculture uses it to monitor crop health from drones.
How It Works
Modern computer vision is built on deep learning, specifically convolutional neural networks (CNNs) and increasingly vision transformers.
CNNs work by sliding small filters across the image. Each filter detects a specific pattern — horizontal edges, vertical edges, corners, textures. Early layers detect simple patterns. Deeper layers combine them into complex features like eyes, wheels, or text. By the final layers, the network can classify entire objects.
Vision transformers (ViTs) split images into patches and process them like tokens in a language model, using attention mechanisms to understand how different parts of the image relate to each other. They've matched or beaten CNNs on many benchmarks.
The key tasks in computer vision include:
Image classification: "This is a cat." The model assigns the whole image to a category.
Object detection: "There's a cat at coordinates (120, 340) and a dog at (500, 200)." Finds objects and draws bounding boxes around them. YOLO (You Only Look Once) is the most famous real-time detection model.
Semantic segmentation: Labels every pixel in the image. "These pixels are road, these are sidewalk, these are car." Critical for autonomous driving.
Image generation: Creating new images from text descriptions or other inputs. Diffusion models (Stable Diffusion, DALL-E, Midjourney) have revolutionized this area.
Key Examples
Medical imaging: AI models detect cancer in mammograms, retinal diseases in eye scans, and fractures in X-rays. Some systems outperform individual radiologists, though they work best alongside human doctors, not replacing them.
Autonomous vehicles: Tesla, Waymo, and Cruise use computer vision to understand driving environments in real time. Tesla's approach uses cameras only, while others combine cameras with lidar.
Content moderation: Social media platforms use CV to automatically detect and remove violent or explicit images at scale.
Retail: Amazon Go stores use computer vision to track what customers pick up, enabling checkout-free shopping.
Where to Go Next
- → Diffusion Models — how AI generates images
- → Multimodal AI — combining vision with language
- → Deep Learning — the technology behind CV
- → Neural Networks — the building blocks