AVGen-Bench: A New Era for Text-to-Audio-Video Evaluation

Text-to-Audio-Video (T2AV) generation is swiftly becoming a fundamental tool in media creation. Yet, the evaluation of such systems has lagged, often missing the mark by assessing audio and video separately or relying on simplistic metrics. Enter AVGen-Bench, a novel benchmark designed to address these shortcomings.

A Comprehensive Benchmark

AVGen-Bench is tailored for T2AV generation, offering high-quality prompts across 11 real-world categories. It's a task-driven benchmark that promises to set a new standard for how we assess these complex systems.

The paper's key contribution is a multi-granular evaluation framework. This framework combines lightweight specialist models with Multimodal Large Language Models (MLLMs) to cover everything from perceptual quality to detailed semantic control. This is key because, without such a framework, evaluations risk being shallow and not reflective of real-world use cases.

The Unveiled Gaps

What's the importance of this benchmark? It uncovers a significant gap between the strong aesthetic qualities and the weak semantic reliability of T2AV models. Current systems struggle with persistent issues in text rendering, speech coherence, physical reasoning, and, notably, a universal breakdown in musical pitch control. These aren't trivial problems. They're fundamental inconsistencies that could undermine the utility of T2AV systems in practical applications.

Code and data are available at http://aka.ms/avgenbench. This access to resources is essential for fostering reproducibility and allowing others to validate or challenge the findings.

Why It Matters

Why should we care about these gaps? Because the future of media creation depends on smooth integration of audio, video, and text. Imagine a world where content can be generated with the touch of a button, yet the output consistently fails to match the intended semantics. That's the risk if these issues aren't addressed.

, AVGen-Bench represents a significant step forward in evaluating T2AV systems. But it's also a wake-up call. The industry needs to step up its game in semantic reliability. The question remains: will developers rise to the challenge?

AVGen-Bench: A New Era for Text-to-Audio-Video Evaluation

A Comprehensive Benchmark

The Unveiled Gaps

Why It Matters

Key Terms Explained