Rethinking Compute in Behavioral Models: The Case for a Slim Embedder
As foundation models evolve, compute calibration is lagging behind. A recent study suggests slimming down the embedder could be the key to efficient training.
AI, the spotlight often shines on language models. Yet, as foundation models begin to embrace user actions in recommendation, payments, and commerce, it's clear they face unique challenges. While language models benefit from established scaling laws, behavioral models are still searching for their own path to compute efficiency.
The Slim Embedder: A Smart Move
A recent study examined a common two-part architecture in behavioral models: a feature-based event embedder and a decoder-only transformer. Across an impressive 600 runs on real-world interaction data, the research explored scaling from 1015to 1019training FLOPs. The findings are quite revealing. A small embedder, consuming just 2% of the parameters, emerged as the compute-optimal choice at every budget level tested.
Why is this significant? Embedder parameters are pricier per step and encounter more repetition than their contextualizer counterparts. This means that a leaner embedder isn't just a matter of cutting costs, it's a strategic decision to enhance compute efficiency.
Data-Heavy Yet Efficient
Compute-optimal training for these models leans heavily on data, especially at lower compute levels. However, as compute power ramps up, the data-to-parameter ratio begins aligning with the Chinchilla heuristic, a notable trend for those familiar with scaling principles in language models.
This raises an intriguing question: as we push the boundaries of compute, can the same principles guide the evolution of both language and behavioral models? The container doesn't care about your consensus mechanism, but efficiency clearly matters.
Metrics and Scaling: A Dynamic Relationship
The study also highlighted an important dynamic: the relationship between training objectives and deployed ranking metrics shifts with compute. Notably, factors such as critical batch size and optimal negative count after freezing the embedder change as compute scales.
For those managing large budgets, negative sampling preferences lean toward increasing negatives. By 1019FLOPs, memory is a constraint rather than FLOPs. This suggests that in behavioral models, the choice of evaluation metric can fundamentally alter the compute-optimal recipe.
Enterprise AI is boring. That's why it works. It's the efficiency gains, like a 40% reduction in document processing time, that turn heads, not flashy new algorithms. As models push into new territories like behavior tracking, nuances like these are where the real ROI lies.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
The number of training examples processed together before the model updates its weights.
A research paper from DeepMind that proved most large language models were over-sized and under-trained.
The processing power needed to train and run AI models.
The part of a neural network that generates output from an internal representation.