VMBench: A Benchmark for Perception-Aligned Video Motion Generation

Xinran Ling1*, Chen Zhu1*, Meiqi Wu1,3*, Hangyu Li1, Xiaokun Feng1,2,
Cundian Yang1, Aiming Hao1, Jiashu Zhu1, Jiahong Wu1†, Xiangxiang Chu1
(* equal contributions, † corresponding authors)
1 AMAP, Alibaba Group    2 CRISE, Institute of Automation, Chinese Academy of Sciences   
3 School of Computer Science and Technology, University of Chinese Academy of Sciences   
Teaser.

Overview of VMBench. Our benchmark encompasses six principal categories of motion patterns, with each prompt constructed as a comprehensive motion structured around three core components: subject, place, and acion. We propose a novel multi-dimensional video motion evaluation comprising five human-centric quality metrics derived from perceptual preferences. Utilizing videos generated by popular T2V models, we conduct systematic human evaluations to validate the effectiveness of our metrics in capturing human perceptual preferences.

Abstract

Video generation has advanced rapidly, improving evaluation methods, yet assessing video's motion remains a major challenge. Specifically, there are two key issues: 1) current motion metrics do not fully align with human perceptions; 2) the existing motion prompts are limited. Based on these findings, we introduce VMBench---a comprehensive Video Motion Benchmark that has perception-aligned motion metrics and features the most diverse types of motion. VMBench has several appealing properties: (1) Perception-Driven Motion Evaluation Metrics, we identify five dimensions based on human perception in motion video assessment and develop fine-grained evaluation metrics, providing deeper insights into models' strengths and weaknesses in motion quality. (2) Meta-Guided Motion Prompt Generation, a structured method that extracts meta-information, generates diverse motion prompts with LLMs, and refines them through human-AI validation, resulting in a multi-level prompt library covering six key dynamic scene dimensions. (3) Human-Aligned Validation Mechanism, we provide human preference annotations to validate our benchmarks, with our metrics achieving an average 35.3% improvement in Spearman’s correlation over baseline methods. This is the first time that the quality of motion in videos has been evaluated from the perspective of human perception alignment.

Perception-Driven Motion Evaluation Metrics (PMM)

Framework of our Perception-Driven Motion Metrics (PMM). PMM comprises multiple evaluation metrics: Commonsense Adherence Score (CAS), Motion Smoothness Score (MSS), Object Integrity Score (OIS), Perceptible Amplitude Score (PAS), and Temporal Coherence Score (TCS). (a-e): Computational flowcharts for each metric. The scores produced by PMM show variation trends consistent with human assessments, indicating strong alignment with human perception.

Diagram of Human Perception Flow

Our metrics framework for evaluating video motion, which is inspired by the mechanisms of human perception of motion in videos. (a) Human perception of motion in videos primarily encompasses two dimensions: Comprehensive Analysis of Motion and Capture of Motion Details. (b) Our proposed metrics framework for evaluating video motion. Specifically, the MSS and CAS correspond to the human process of Comprehensive Analysis of Motion, while the OIS, PAS, and TCS correspond to the capture of motion details.

Meta-Guided Motion Prompt Generation (MMPG)

Framework of our Meta-Guided Motion Prompt Generation (MMPG). MMPG consists of three stages: (a) Metainformation Extraction: Extracting Subjects, Places, and Actions from datasets such as VidProm [30], Didemo [35], MSRVTT [34], WebVid [33], Place365 [31], and Kinect-700 [32]. (b) Self-Refining Prompt Generation: Generating and iteratively refining prompts based on the extracted information. (c) Human-LLM Joint Validation: Validating the prompts through a collaborative process between humans and DeepSeek-R1 to ensure their rationality

VMBench Meta-Guided Motion Prompt Statistics

Statistical analysis of motion prompts in VMBench. (a-h): Multi-perspective statistical analysis of prompts in VMBench. These analyses demonstrate VMBench’s comprehensive evaluation scope, encompassing motion dynamics, information diversity, and real-world commonsense adherence.

VMBench Generation Results of Open-Source Models

Prompt: A tourist joyfully splashes water in an outdoor swimming pool, their arms and legs moving energetically as they playfully splash around.

Prompt: Three books are thrown into the air, their pages fluttering as they soar over the soccer field, landing in a scattered pattern.

VMBench Evaluation Results of Video Generative Models

We visualize the evaluation results of the 6 most recent video generation models across Perception-Driven Motion Evaluation Metrics (PMM) dimensions.

BibTeX

If you find our work useful, please consider citing our paper:

@misc{ling2025vmbenchbenchmarkperceptionalignedvideo,
      title={VMBench: A Benchmark for Perception-Aligned Video Motion Generation},
      author={Xinran Ling and Chen Zhu and Meiqi Wu and Hangyu Li and Xiaokun Feng and Cundian Yang and Aiming Hao and Jiashu Zhu and Jiahong Wu and Xiangxiang Chu},
      year={2025},
      eprint={2503.10076},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.10076},
}