Inconsistencies in AI-Generated Exercise Prescriptions:...

Inconsistencies in AI-Generated Exercise Prescriptions: A Technical Glitch or a Feature?

By Rina ShimizuApril 14, 2026

AI-generated exercise prescriptions show high semantic consistency but vary in quantitative aspects, especially exercise intensity. Why should this matter for clinical deployment?

Large language models (LLMs) like Gemini 2.5 Flash are increasingly being explored for generating personalized exercise prescriptions. But can we rely on them when consistency is key? In a study evaluating the intra-model consistency of these AI-generated prescriptions, the results are both promising and concerning.

High Semantic Consistency

The models demonstrated impressive semantic consistency across different scenarios. Specifically, the mean cosine similarity ranged from 0.879 to 0.939, which is notably high. Clinically constrained cases showed even greater consistency. What the English-language press missed: the consistency in language doesn't necessarily translate to reliable advice.

Quantitative Variability: A Red Flag?

While the frequency of exercise prescriptions was consistent, variability in quantitative components like exercise intensity raises alarms. The paper, published in Japanese, reveals that unclassifiable intensity expressions appeared in 10-25% of resistance training outputs. This means that while the models might speak the same language, they aren't singing the same tune the details.

Safety: A Consistent But Varying Feature

On safety, every generated prescription included safety-related expressions. Yet, the number of safety sentences varied significantly across scenarios. Clinical cases produced more safety expressions than those for healthy adults, with statistical significance (H=86.18, p less than 0.001). The benchmark results speak for themselves, but is this variability acceptable?

The data shows that reliance on LLMs for exercise prescriptions demands caution. The need for additional structural constraints and expert validation is apparent. Before deploying such technology clinically, the discrepancies in quantitative aspects can't be ignored. After all, would you trust a prescription that can't decide how intense your workout should be?

Share this article:

Get AI news in your inbox

Daily digest of what matters in AI.