MLLMs Struggle with Subjective Human Responses in Video...

Multimodal large language models (MLLMs) have been making waves for their prowess in handling objective tasks such as video understanding and reasoning. But, emulating subjective human responses, these models seem to hit a wall. The question is, can AI truly capture the nuances of human perception?

Objective vs. Subjective Performance

Recent research has put MLLMs to the test, evaluating their ability to act as synthetic participants in assessing perceived sensory engagement with short videos. Grounded in the Perceived Message Sensation Value (PMSV) framework, this study compared human ratings with those of profile-conditioned MLLM simulations. The findings reveal a notable gap in performance. Even leading models like Gemini 3 Flash and Qwen 3 Omni showed limited agreement with human participants.

The models displayed distinct downward mean-shift and central-tendency biases in their rating distributions. This suggests that while they might excel at objective analysis, capturing the subjective experience is a different beast altogether.

Challenges in Capturing Human Nuance

The study, involving 673 simulations, used a 17-item scale to measure emotional arousal, dramatic impact, and novelty. However, MLLMs have a tendency to both introduce and flatten subgroup differences, which points to their inconsistent sensitivity to participant profiles. It seems these models are still grappling with the complexity of human emotion and context.

prompting strategies, which initially seemed like a promising solution, only modestly improved certain aspects while worsening others. The AI-AI Venn diagram is getting thicker, but it's clear there's still much to learn about the intersection of machine analysis and human emotion.

Opportunities for Improvement

Despite these challenges, the potential for MLLMs in video-based research isn't entirely dim. The study highlights opportunities for development in this area. If we can fine-tune these models to better understand and simulate human subjectivity, the implications could be significant.

The compute layer needs a payment rail, and as we continue building the financial plumbing for machines, understanding human emotion will be key to unlocking new AI capabilities. But for now, the question remains: Can MLLMs ever truly bridge the gap between machine inference and human intuition?

MLLMs Struggle with Subjective Human Responses in Video Research

Objective vs. Subjective Performance

Challenges in Capturing Human Nuance

Opportunities for Improvement

Key Terms Explained