A New Dawn for Neural Combinatorial Optimization: GRPO's...

In the intricate dance of neural combinatorial optimization (NCO), the method of choice can make all the difference. Traditional training algorithms like REINFORCE, which rely heavily on a baseline for variance reduction, often find themselves stumbling on tougher instances. Here, the baseline, once a stabilizing force, can become a structural Achilles' heel, introducing noise and potential destabilization into the training process.

A New Approach: GRPO

Enter Group Relative Policy Optimization (GRPO), an algorithm borrowed from large language model alignment. This latest method eschews the baseline altogether, normalizing advantages within groups of sampled trajectories. The real question is: can GRPO bring stability and efficiency to NCO where baseline-dependent methods falter?

In a competitive evaluation of five reinforcement learning algorithms, GRPO was put to the test on renowned benchmarks such as the Traveling Salesman Problem (TSP) and Capacitated Vehicle Routing Problem (CVRP) within the RL4CO framework. The results, to put it lightly, are stirring. In one scenario, GRPO avoided the notorious training collapse that REINFORCE exhibited on TSP-100, where performance plummeted from cost 9.8 to a staggering 52.1 post-warmup, with no recovery in sight even after extended training.

GRPO vs. The Establishment

But how does GRPO measure up against established heavyweights like POMO, a strong multi-start baseline from the attention model family? At equivalent gradient updates, GRPO delivered solutions within a mere 2% of POMO's quality, yet without the need for an external baseline. This positions GRPO not just as an alternative, but as a formidable contender in baseline-free neural combinatorial optimization.

Meanwhile, P3O, another algorithm inspired by alignment literature, displayed competitive performance on TSP. However, it exhibited higher variability on CVRP, highlighting that not all approaches are created equal across different problem settings.

Implications and Future Outlook

Reading the legislative tea leaves of neural optimization, GRPO's emergence signals a shift towards robustness and adaptability in algorithmic training. As baseline dependency proves fragile, especially with challenging instances, GRPO's baseline-free promise could be the harbinger of a new era. The question now is whether the broader research community will embrace this shift or cling to the traditional security of baseline-based methods.

Ultimately, the strides made by GRPO in the area of neural combinatorial optimization can't be overlooked. For those vested in the evolution of algorithmic efficiency and stability, GRPO's ascent is a development worth watching closely.

A New Dawn for Neural Combinatorial Optimization: GRPO's Promising Leap

A New Approach: GRPO

GRPO vs. The Establishment

Implications and Future Outlook

Key Terms Explained