DRIVE: Redefining Web Agents with Dual-Level Skills

Web agents are at the crux of conquering complex online tasks. They require a balance between high-level reasoning for task decomposition and low-level interactions for executing page-specific actions. Yet, existing systems struggle to harmonize these fundamentally different knowledge types. This often results in agents that either misinterpret reusable logic due to superficial webpage differences or attempt outdated, futile actions.

The DRIVE Solution

Enter DRIVE, a novel framework designed to tackle this challenge head-on. By separating historical experience into two distinct skill sets, natural language reasoning skills for transferable logic and programmatic interaction skills for executable operations, it addresses the entanglement dilemma head-on. This dual-level skill modeling allows for more adaptable and effective web agents.

DRIVE utilizes a scene-aware coordination mechanism that smartly retrieves and applies these skills based on task semantics. Crucially, it employs skill-level reflection to pinpoint failure modes specific to each hierarchy, paving the way for targeted improvement and expansion of its skill library.

Performance and Implications

In testing, DRIVE demonstrated an average task success rate of 52.8% across five WebArena domains, surpassing the skill-free baseline by 7.3 percentage points. The ablation study reveals that the separation of reasoning and interaction skills not only provides distinct benefits but also enhances overall system performance.

Why does this matter? for the future of web automation and AI applications. DRIVE's architecture offers a glimpse into how web agents can become more versatile and effective across different websites, adapting to new environments without losing efficiency. Could this be the catalyst for more autonomous digital agents?

What's Next?

While DRIVE makes significant strides, questions about scalability and long-term adaptability remain. Can it maintain its edge as web environments continue to evolve? As the technology progresses, one thing is clear: the separation of abstract reasoning from concrete interactions is a necessary step forward in creating truly intelligent web agents.

DRIVE: Redefining Web Agents with Dual-Level Skills

The DRIVE Solution

Performance and Implications

What's Next?

Key Terms Explained