Why Reinforcement Learning Is More Random Than You Think

Ever wondered why your reinforcement learning (RL) models don't always deliver consistent results? You're not alone. Deep reinforcement learning algorithms, hailed as the future of AI, are notorious for their unpredictability. Despite being identically configured, these algorithms can show drastic performance variations across different runs. That's right. Your model's not broken. It's just inherently unstable.

The Problem with Current Metrics

Let's talk numbers. Most RL research gives us uncertainty metrics on estimated mean performance. Sounds solid, right? Wrong. These metrics are largely misaligned with real-world needs, often understating the actual performance variation. It's like using a magnifying glass when you need a microscope. The funding rate is lying to you again.

A New Approach

So, what's the alternative? Enter percentile-based statistics. Instead of vague averages, this approach employs min-max interpolated percentile ranges (IPR) alongside run-wise percentile highlighting. These tools are much clearer and rely on standard properties of sample percentiles, offering rich insights into run-to-run performance shifts. It's like moving from black-and-white to full color.

Case Studies: Winners and Losers

Here's where the rubber meets the road. Through various case studies, this new method shows its value. LayerNorm and penultimate-layer normalizations tighten performance variation in Proximal Policy Optimization (PPO). Meanwhile, Soft Actor-Critic (SAC) doesn't see much change. Talk about uneven results. Everyone has a plan until liquidation hits.

Then there's a showdown: PPO, SAC, TD-MPC, and TD-MPC2 go head-to-head. TD-MPC stands out with the least variation and the most data efficiency. But is that enough? Finally, a comparison between Deep Q-Network (DQN) and Rainbow in five Atari environments reveals similar variation levels. You'd expect more distinction, wouldn't you?

Why This Matters

So, why should you care? Because if you're betting on AI, you need reliability. You wouldn't drive a car that sometimes turns left when you steer right, would you? These insights challenge the rosy narratives around RL. Bullish on hopium. Bearish on math.

This ends badly. The data already knows it. Until these models mature, expect surprises. And not the good kind.