Rethinking Compute in Behavioral Models: The Case for a...

AI, the spotlight often shines on language models. Yet, as foundation models begin to embrace user actions in recommendation, payments, and commerce, it's clear they face unique challenges. While language models benefit from established scaling laws, behavioral models are still searching for their own path to compute efficiency.

The Slim Embedder: A Smart Move

A recent study examined a common two-part architecture in behavioral models: a feature-based event embedder and a decoder-only transformer. Across an impressive 600 runs on real-world interaction data, the research explored scaling from 10¹⁵to 10¹⁹training FLOPs. The findings are quite revealing. A small embedder, consuming just 2% of the parameters, emerged as the compute-optimal choice at every budget level tested.

Why is this significant? Embedder parameters are pricier per step and encounter more repetition than their contextualizer counterparts. This means that a leaner embedder isn't just a matter of cutting costs, it's a strategic decision to enhance compute efficiency.

Data-Heavy Yet Efficient

Compute-optimal training for these models leans heavily on data, especially at lower compute levels. However, as compute power ramps up, the data-to-parameter ratio begins aligning with the Chinchilla heuristic, a notable trend for those familiar with scaling principles in language models.

This raises an intriguing question: as we push the boundaries of compute, can the same principles guide the evolution of both language and behavioral models? The container doesn't care about your consensus mechanism, but efficiency clearly matters.

Metrics and Scaling: A Dynamic Relationship

The study also highlighted an important dynamic: the relationship between training objectives and deployed ranking metrics shifts with compute. Notably, factors such as critical batch size and optimal negative count after freezing the embedder change as compute scales.

For those managing large budgets, negative sampling preferences lean toward increasing negatives. By 10¹⁹FLOPs, memory is a constraint rather than FLOPs. This suggests that in behavioral models, the choice of evaluation metric can fundamentally alter the compute-optimal recipe.

Enterprise AI is boring. That's why it works. It's the efficiency gains, like a 40% reduction in document processing time, that turn heads, not flashy new algorithms. As models push into new territories like behavior tracking, nuances like these are where the real ROI lies.

Rethinking Compute in Behavioral Models: The Case for a Slim Embedder

The Slim Embedder: A Smart Move

Data-Heavy Yet Efficient

Metrics and Scaling: A Dynamic Relationship

Key Terms Explained