VenusBench-Mobile: Redefining Mobile GUI Agent Evaluations

Evaluating mobile GUI agents has hit a snag. The traditional benchmarks we've relied upon are app-centric and task-homogeneous. They don't capture the chaotic and diverse nature of real-world mobile usage. Enter VenusBench-Mobile, a novel benchmark that aims to shift the landscape by introducing user-centric and realistic task evaluations.

A New Benchmarking Era

VenusBench-Mobile constructs its evaluation framework on two pillars. First, it redefines what to evaluate by designing tasks driven by user intent. This approach mirrors the varied ways we interact with our devices daily. Second, it revolutionizes how we evaluate by employing a capability-oriented annotation scheme. This allows for a granular analysis of agent behavior, offering insights previously masked by broader evaluations.

Crucially, the extensive evaluation of state-of-the-art agents using VenusBench-Mobile reveals stark performance disparities. Compared to older benchmarks, agents falter significantly when faced with VenusBench-Mobile's more challenging and realistic tasks. The paper's key contribution: highlighting these gaps and pushing for more strong agent development.

Agent Deficiencies Uncovered

The results from VenusBench-Mobile's evaluations are telling. Current mobile GUI agents show glaring deficiencies in perception and memory. These are areas often glossed over by conventional evaluations but become apparent under the new benchmark's scrutiny. Moreover, the agents' performances drop to near-zero success rates when encountering variations in environment settings. This brittleness raises a important question: Are existing agents genuinely ready for real-world deployment?

The ablation study reveals that even top-tier agents struggle when tested under conditions that mimic genuine user interactions. This isn't just a minor hurdle. It's a significant barrier to achieving reliable, everyday use of mobile GUI agents. The industry must pivot towards addressing these shortcomings if it hopes to realize the full potential of automation in mobile interfaces.

Path Forward

VenusBench-Mobile isn't just a new benchmark. It's a wake-up call. The research indicates that the path to strong, real-world mobile GUI agents is fraught with challenges that need immediate attention. The benchmark provides a much-needed stepping stone, urging developers and researchers to address the glaring performance gaps.

What they did, why it matters, what's missing. VenusBench-Mobile's insights should catalyze a shift in how we approach the development and evaluation of mobile agents. Code and data are available at https://github.com/inclusionAI/UI-Venus/tree/VenusBench-Mobile. Are the agents up to the task? The answer, as VenusBench-Mobile reveals, is a resounding 'not yet.'

VenusBench-Mobile: Redefining Mobile GUI Agent Evaluations

A New Benchmarking Era

Agent Deficiencies Uncovered

Path Forward

Key Terms Explained