RePoT: A New Paradigm in AI Planning

AI, where precision and efficiency drive innovation, one-shot planning models have dominated the scene. However, their limitations are glaringly evident. Enter RePoT, an advancement over the Program-of-Thought (PoT) model that addresses these shortcomings with a keen focus on recovery and verification.

A Leap Beyond Traditional Models

The conventional PoT framework is pretty straightforward: generate a Python program to execute a plan. Yet, a single mistake can render the plan useless. RePoT, or Recoverable PoT, tackles this by introducing a deterministic verified replay mechanism. It meticulously walks through the plan, identifying its first misstep, and then calls upon a language model to resume from a verified point. Remarkably, this method requires only one additional LLM call in approximately 14% of problematic cases.

Why does this matter? Because RePoT significantly boosts performance. Across four closed-model configurations on PuzzleZoo-775, RePoT outshines PoT by a margin of 3 to 11 percentage points. In particular, it achieves a peak performance of 96.9% on the gpt-5.4-mini-medium model, compared to PoT's 86.3%.

Convergence of Efficiency and Accuracy

The AI-AI Venn diagram is getting thicker. As RePoT competes against the PoT-retry baseline, it decisively wins on Gemini by 3.8 percentage points, albeit performs similarly on GPT-medium and Claude, but falters on GPT-mini. This performance pattern suggests a need for adaptive strategies. Enter Adaptive RePoT, which employs a rule-based dispatcher to choose between suffix repair and fresh retries based on the length of a verified prefix.

The compute layer needs a payment rail, and RePoT seems to be building the financial plumbing for machines. On PlanBench Blocksworld, RePoT exhibits improvements ranging from 1.1 to 11.4 percentage points. It also boasts gains of 3.3 to 20.0 percentage points on three of four open-weight models.

Why Checkpoints Matter

In AI planning, it's not just about executing plans but recovering from errors effectively. Our controlled recovery benchmark, Derail-550, underscores this point. Here, conditions with access to checkpoint information consistently clear at least 30% on GPT-medium and 70% on Gemini. In stark contrast, scenarios relying solely on error feedback lag significantly, clearing only up to 3.1%.

The takeaway? Checkpoint information is essential for effective recovery. It acts as the backbone of RePoT's superior planning capability. If agents have wallets, who holds the keys? In AI's growing autonomy, the answer may lie in how efficiently we manage and recover from our missteps.