RePoT: Fixing AI Plans with Smarter Recovery

When AI programs generate a plan, each action must be spot-on or the entire trajectory collapses. Enter RePoT, or Recoverable Program-of-Thought, which steps up the game by verifying each move and swiftly correcting missteps. Unlike its predecessor, PoT, which falters with invalid actions, RePoT is designed to recover from these blunders, ensuring smoother execution.

Smarter Recovery Mechanism

RePoT shines by using a deterministic verified replay mechanism. Simply put, it walks through a plan until it hits a snag, then uses one Large Language Model (LLM) call to navigate back on track. This method is a major shift for about 14% of problems where PoT typically fails. In AI performance terms, RePoT boosts success rates by 3 to 11 percentage points across various closed-model configurations, marking a peak performance of 96.9% on gpt-5.4-mini-medium models, compared to PoT's 86.3%.

Outperforming the Competition

Against the baseline PoT-retry approach, RePoT is a clear winner on the Gemini model, improving performance by 3.8 percentage points. However, it matches PoT on the GPT-medium and Claude models, and surprisingly loses ground on the GPT-mini. This performance pattern raises an important question: Are AI models scaling beyond their intended capabilities?

Adaptive RePoT steps in as a potential solution, using a rule-based dispatcher to decide when to repair a suffix or retry from scratch, based on the length of verified prefix. Though preliminary, this approach shows promise in addressing the capability-scaling issue.

Benchmarking and Future Prospects

RePoT's effectiveness extends beyond just closed-model configurations. On PlanBench Blocksworld, it racks up improvements ranging from 1.1 to 11.4 percentage points. On open-weights models, it delivers gains of 3.3 to 20.0 percentage points across most configurations. In the controlled Derail-550 benchmark, RePoT proves that access to checkpoint information is important, showing a marked improvement over error-only feedback.

This is where RePoT's breakthrough matters. If AI systems can recover from their own errors so efficiently, what does this mean for the future of autonomous agents? With better recovery mechanisms, AI could be trusted with more complex tasks, reducing the need for human oversight in routine operations.

The intersection of AI and reliable error recovery is real, but not every project will get it right. RePoT's approach offers a glimpse into a future where AI doesn't just execute plans, but adapts in real-time to ensure success. It's a step forward, but let's see who can keep up with the benchmark.