Revolutionizing Agentic Models: From Substrates to...

Training small-model agentic systems has traditionally hit a wall due to the limitations of the trajectory data they consume. The usual suspects, RLVR, group-relative RL, and rejection-sampled re-SFT, rely on multi-turn traces with comprehensive supervision. However, current data sources either inherit biases from synthesizers or suffer from contamination and lack of judgment in production logs. In short, they're not cutting it anymore.

The Arena Approach

The ORO Subnet 15 (SN15) offers a promising solution through an engineered agent arena, specifically tailored for the ShoppingBench agentic-commerce benchmark. This setup is more than a typical partnership announcement. It's a convergence of innovative mechanisms: a race mechanism, a reasoning judge using large language models (LLMs), and a rotating problem suite designed to guard against information leaks. Together, these features create a data corpus that's diverse, well-judged, and resistant to memorization.

But why does this matter? Because SN15 doesn't just generate data. It generates incentive-aligned data, meaning that the data is structured in a way that aligns with the desired outcomes of the model. It also includes a structural-quality filter that distinguishes between agentic trajectories, where the model actively makes tool calls, and sub-task trajectories, which are mere classifications or deterministic loops.

Performance and Potential

The results speak volumes. By training Qwen3-4B with this refined data corpus and aligning it with the ShoppingBench SFT-then-GRPO pipeline, a leap from an 18.0% base average success rate (ASR) to 42.7% was achieved. That's noteworthy, considering it was done within a fraction of a single day of subnet output. The leap was near the synthetic-data SFT-only baseline of 43.6%.

Yet, there's more to explore. The supervised training stack leaves a notable gap between pass@8 to pass@1 scores (53.3% to 34.8%). Enter Dr. GRPO's teacher-grounded rewards, offering a potential path to bridge this divide significantly. But how far can this go? The sub-task firehose emerges as the key lever in closing the gap to the SFT+GRPO benchmark of 48.7%.

Releasing the Tools

The developers have made the filter, corpus splits, and arena mechanics publicly available. This move doesn't just promise collaborative advancement but invites the community to participate in shaping the future of agentic AI training. We're building the financial plumbing for machines, and this release lays down the pipes.

If AI agents are to hold their own in dynamic environments, the tools that train them must evolve. The shift towards structured, incentive-aligned data isn't just a technical upgrade. it's a strategic pivot in machine learning's trajectory.

Revolutionizing Agentic Models: From Substrates to Superior Training

The Arena Approach

Performance and Potential

Releasing the Tools

Key Terms Explained