Artificially generated data used for training AI models. Can be created by other AI models, simulations, or procedural generation. Useful when real data is scarce, private, or biased. Increasingly used to train and evaluate models, but risks introducing its own biases and distribution issues.
Synthetic data is artificially generated data used to train or evaluate AI models, rather than data collected from real-world sources. AI models can generate training data for other models — or even for improved versions of themselves. It's one of the most important and controversial techniques in modern AI development.
The use cases are broad. Can't get enough labeled medical images? Generate synthetic ones. Need diverse conversational data? Have an LLM generate thousands of sample dialogues. Want to test edge cases? Synthesize scenarios that rarely occur naturally. Companies like Scale AI and Anthropic use synthetic data extensively. The Phi models from Microsoft showed that small models trained on high-quality synthetic data can punch way above their weight class.
The controversy centers on data quality and "model collapse." If you train models on AI-generated data that was itself produced by AI-trained models, errors can compound and diversity can shrink over generations — like making a photocopy of a photocopy. The solution is careful curation: mix synthetic data with real data, filter for quality, and verify accuracy. When done right, synthetic data extends what's possible. When done carelessly, it degrades model quality in subtle ways.
"We couldn't get enough real examples of rare error codes, so we used GPT-4 to generate 5,000 synthetic support tickets for fine-tuning our classifier."
Techniques for artificially expanding training datasets by creating modified versions of existing data.
AI systems that create new content — text, images, audio, video, or code — rather than just analyzing or classifying existing data.
A mathematical function applied to a neuron's output that introduces non-linearity into the network.
An optimization algorithm that combines the best parts of two other methods — AdaGrad and RMSProp.
Artificial General Intelligence.
An autonomous AI system that can perceive its environment, make decisions, and take actions to achieve goals.
Browse our complete glossary or subscribe to our newsletter for the latest AI news and insights.