Why Algorithm Wars in AI Training Might Be Overblown
A new study compares 51 post-training algorithms, revealing model scale as the real powerhouse over algorithm choice. Are we focusing on the wrong thing?
In the crowded landscape of AI development, a new heavyweight contender has thrown its hat into the ring: OXRL. This framework brings together 51 post-training algorithms under the same roof, allowing an unprecedented apples-to-apples evaluation. With over 240 training runs on some of the most powerful hardware around, OXRL is setting the stage for a deeper understanding of AI training.
The Algorithm Showdown
AI training, the industry is a bit of a battlefield with algorithms like DPO, SimPO, and KTO vying for supremacy. But OXRL's findings suggest we might be missing the forest for the trees. Across 8 algorithms, 4 different model scales (from 0.5B to 7B), and various evaluation domains, one thing became clear: the scale of the model trumps the choice of algorithm.
At 1.5B parameter scale, online RL like SGRPO led the pack with a 58% success rate on the GSM8K benchmark. Yet by 7B, the narrative flipped completely, with SimPO, previously the underdog, soaring to the top at 85.8%. Does this mean the algorithm wars are over? Seems like scale is the real MVP here.
Loss Functions: The Barely-There Impact
OXRL's deep dive also dismantled the big buzz around loss function tweaks. Among the 20 DPO variants tested, none significantly outperformed the standard version. In fact, SimPO lagged behind by a staggering 11.5 percentage points. So, should we really be sweating the small stuff loss functions? It seems not.
Instead, it's clear that where your algorithm stands out is largely task-specific. While differences in algorithm performance were stark on the GSM8K benchmark, they all but vanished on MATH and general-domain tests. This highlights that the impact of algorithm selection might only matter within specific training distributions.
Reassessing Priorities
The study concludes with a hierarchy of what truly matters in AI training: model scale delivers a 50-point punch, followed by training paradigms (10 points), and the online versus offline debate (9 points). Loss functions barely make a dent with just 1 point. The builders never left, but maybe they need to shift focus.
Given these findings, it's worth asking: Are we putting too much focus on the algorithm's bells and whistles when scale is the real breakthrough? The meta shifted. Keep up, because focusing too much on algorithms might just be a distraction from what's truly driving AI forward.
All of OXRL's code, configurations, and evaluation data are available as a living benchmark for the community to explore, ensuring that this conversation continues to evolve with real, tangible data backing it up.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
A standardized test used to measure and compare AI model performance.
Direct Preference Optimization.
The process of measuring how well an AI model performs on its intended task.
A mathematical function that measures how far the model's predictions are from the correct answers.