Vision Language Models Struggle with Action Quality Assessment
Vision Language Models show potential in assessing action quality, but face challenges. Current models barely outperform chance, revealing biases and fundamental issues.
Action Quality Assessment (AQA) could revolutionize fields like physical therapy and sports coaching. However, Vision Language Models (VLMs) have yet to unlock this potential. Our recent evaluation highlights significant gaps in their performance across various activity domains such as fitness, figure skating, and diving.
Current VLM Performance
The investigation into SOTA VLMs like Gemini 3.1 Pro, Qwen3-VL, and InternVL3.5 reveals a sobering reality. These models only perform slightly better than random chance in AQA tasks. While approaches such as incorporating skeleton information and grounding instructions offer some improvements, no technique consistently enhances outcomes. This raises a critical question: Are VLMs truly ready for real-world applications in AQA?
Biases and Limitations
The paper's key contribution is the identification of systematic biases. VLMs tend to predict correct execution regardless of the visual evidence and are sensitive to superficial linguistic framing. Attempts to reformulate tasks to mitigate these biases resulted in negligible improvements. This indicates that the models face a fundamental difficulty with assessing fine-grained movement quality.
Where Do We Go From Here?
So, why should we care? These findings establish a rigorous baseline for future research, but they also serve as a cautionary tale. Before deploying VLMs in real-world AQA scenarios, it's key to address these biases and limitations. The ablation study reveals that while isolated gains can be achieved, a universal solution remains elusive.
Does this mean we should abandon VLMs for AQA? Not necessarily. The study provides a detailed outline of failure modes that need mitigation. This builds on prior work from the NLP and computer vision communities, emphasizing the need for a more nuanced understanding of movement quality assessment. With further research, VLMs could eventually meet the high standards required for reliable deployment.
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
The field of AI focused on enabling machines to interpret and understand visual information from images and video.
The process of measuring how well an AI model performs on its intended task.
Google's flagship multimodal AI model family, developed by Google DeepMind.
Connecting an AI model's outputs to verified, factual information sources.