Vision Language Models Struggle with Action Quality...

Vision Language Models Struggle with Action Quality Assessment

By Signe EriksenApril 10, 2026

Vision Language Models show potential in assessing action quality, but face challenges. Current models barely outperform chance, revealing biases and fundamental issues.

Action Quality Assessment (AQA) could revolutionize fields like physical therapy and sports coaching. However, Vision Language Models (VLMs) have yet to unlock this potential. Our recent evaluation highlights significant gaps in their performance across various activity domains such as fitness, figure skating, and diving.

Current VLM Performance

The investigation into SOTA VLMs like Gemini 3.1 Pro, Qwen3-VL, and InternVL3.5 reveals a sobering reality. These models only perform slightly better than random chance in AQA tasks. While approaches such as incorporating skeleton information and grounding instructions offer some improvements, no technique consistently enhances outcomes. This raises a critical question: Are VLMs truly ready for real-world applications in AQA?

Biases and Limitations

The paper's key contribution is the identification of systematic biases. VLMs tend to predict correct execution regardless of the visual evidence and are sensitive to superficial linguistic framing. Attempts to reformulate tasks to mitigate these biases resulted in negligible improvements. This indicates that the models face a fundamental difficulty with assessing fine-grained movement quality.

Where Do We Go From Here?

So, why should we care? These findings establish a rigorous baseline for future research, but they also serve as a cautionary tale. Before deploying VLMs in real-world AQA scenarios, it's key to address these biases and limitations. The ablation study reveals that while isolated gains can be achieved, a universal solution remains elusive.

Does this mean we should abandon VLMs for AQA? Not necessarily. The study provides a detailed outline of failure modes that need mitigation. This builds on prior work from the NLP and computer vision communities, emphasizing the need for a more nuanced understanding of movement quality assessment. With further research, VLMs could eventually meet the high standards required for reliable deployment.

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.