A pre-training technique where random words in text are hidden (masked) and the model learns to predict them from context.
A pre-training technique where random words in text are hidden (masked) and the model learns to predict them from context. BERT's core training method. Unlike autoregressive models that only look left-to-right, MLM considers both directions, giving better contextual understanding for classification tasks.
Bidirectional Encoder Representations from Transformers.
The initial, expensive phase of training where a model learns general patterns from a massive dataset.
A training approach where the model creates its own labels from the data itself.
A mathematical function applied to a neuron's output that introduces non-linearity into the network.
An optimization algorithm that combines the best parts of two other methods — AdaGrad and RMSProp.
Artificial General Intelligence.
Browse our complete glossary or subscribe to our newsletter for the latest AI news and insights.