MLLMs Struggle with Subjective Human Responses in Video Research
Multimodal large language models (MLLMs) like Gemini 3 Flash and Qwen 3 Omni excel in objective tasks. Yet, they falter when tasked with mimicking human subjective responses in video engagement assessments.
Multimodal large language models (MLLMs) have been making waves for their prowess in handling objective tasks such as video understanding and reasoning. But, emulating subjective human responses, these models seem to hit a wall. The question is, can AI truly capture the nuances of human perception?
Objective vs. Subjective Performance
Recent research has put MLLMs to the test, evaluating their ability to act as synthetic participants in assessing perceived sensory engagement with short videos. Grounded in the Perceived Message Sensation Value (PMSV) framework, this study compared human ratings with those of profile-conditioned MLLM simulations. The findings reveal a notable gap in performance. Even leading models like Gemini 3 Flash and Qwen 3 Omni showed limited agreement with human participants.
The models displayed distinct downward mean-shift and central-tendency biases in their rating distributions. This suggests that while they might excel at objective analysis, capturing the subjective experience is a different beast altogether.
Challenges in Capturing Human Nuance
The study, involving 673 simulations, used a 17-item scale to measure emotional arousal, dramatic impact, and novelty. However, MLLMs have a tendency to both introduce and flatten subgroup differences, which points to their inconsistent sensitivity to participant profiles. It seems these models are still grappling with the complexity of human emotion and context.
prompting strategies, which initially seemed like a promising solution, only modestly improved certain aspects while worsening others. The AI-AI Venn diagram is getting thicker, but it's clear there's still much to learn about the intersection of machine analysis and human emotion.
Opportunities for Improvement
Despite these challenges, the potential for MLLMs in video-based research isn't entirely dim. The study highlights opportunities for development in this area. If we can fine-tune these models to better understand and simulate human subjectivity, the implications could be significant.
The compute layer needs a payment rail, and as we continue building the financial plumbing for machines, understanding human emotion will be key to unlocking new AI capabilities. But for now, the question remains: Can MLLMs ever truly bridge the gap between machine inference and human intuition?
Get AI news in your inbox
Daily digest of what matters in AI.
Key Terms Explained
The processing power needed to train and run AI models.
Google's flagship multimodal AI model family, developed by Google DeepMind.
Running a trained model to make predictions on new data.
AI models that can understand and generate multiple types of data — text, images, audio, video.