Taming Hallucinations: Boosting MLLMs’ Video Understanding via Counterfactual Video Generation

Zhe Huang1,3*,†, Hao Wen2,3*,†, Aiming Hao3*, Bingze Song3,
1Tsinghua University    2Beihang University    3AMAP, Alibaba Group   
*Co-first Authors    Internship    Project Leader    Corresponding Author

Abstract

Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding. However, they suffer from a critical vulnerability: an over-reliance on language priors, which can lead to "visual ungrounded hallucination", especially when processing counterfactual videos that defy common sense. This limitation, stemming from the intrinsic data imbalance between text and video, is challenging to address due to the substantial cost of generating and annotating counterfactual data. To address this, we introduce DualityForge, a novel counterfactual data synthesis framework that employs controllable, diffusion-based video editing to transform real-world videos into counterfactual scenarios. By embedding structured contextual information into the video editing and QA generation processes, the framework automatically produces high‑quality QA pairs together with original–edited video pairs for contrastive training. Based on this, we build DualityVidQA, a large-scale video dataset designed to reduce MLLM hallucinations. Besides, to fully exploit the contrastive nature of our paired data, we propose Duality-Normalized Advantage Training (DNA-Train), a two-stage SFT-RL training regime where the RL phase incorporates l1 normalization of advantages for each real-counterfactual pair, thereby enabling a more stable and efficient policy optimization. Experiments on DualityVidQA-Test demonstrate that our method substantially reduces model hallucinations on counterfactual videos, yielding a relative improvement of 24.0% over the Qwen2.5-VL-7B baseline. Moreover, our approach achieves significant gains across both hallucination and general-purpose benchmarks, indicating strong generalization capability.

Framework Overview

framework

DualityVidQA Dataset.We begin with a web-sourced real-video dataset and apply a framework integrating MLLMs, grounding and segmentation modules, and image/video editing models to synthesize counterfactual (CF) videos with targeted visual, semantic, and commonsense alterations. Each real-CF video pair is paired with MLLM-generated questions using carefully designed prompts. The dataset comprises three splits: DualityVidQA-SFT with real and counterfactual video-QA pairs (54K + 50K) for SFT; DualityVidQA-RL with 20K shared-question contrastive video-answer pairs (one question, two real/CF instances) for RL; and DualityVidQA-Test (600 pairs), which shares the same contrastive structure as DualityVidQA-RL and covers diverse counterfactual categories.

Example Video

Edit Pipeline

edit_pipeline

Video Edit Pipeline There are 3 pipeline are shown: 1. Visual Anomaly: we use opencv to edit the video in pixel-level. 2. Semantic Anomaly: we use VACE to edit the video in object-level. 3. Common Sense Anomaly: we use MLLM to generate the edit instruction, then use FLUX-Kontext to edit the first frame to end frame, finally use VACE to interpolate the video.

DNA-Train Framework

dna-train

DNA-Train framework. We first perform SFT on our dual dataset to initialize the model. During RL, we sample a group of responses for both real and CF videos, compute their rewards based on task correctness, and calculate the l_1 norm of intra-group advantages. Finally, we normalize the advantages across the dual groups to ensure balanced gradients.

Examples of LLM QA

example

Dataset Statistics

Video Dataset Type Statistics

TypeCount
color27353
replacement9961
appearance6092
disappear5016
common sense86746
All133168

QA Type Frequency Statistics (Overall)

QA TypeReal VideoCounterfactual Video
Multiple Choice1221010224
Open-Ended4266939776
All5487950000

Counterfactual Video Category Statistics

TagCount
causal reversal158
counter physical221
object/scene deformation187
attribute change33
All599

BibTeX

@article{huang2025taming,
  title={Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation},
  author={Huang, Zhe and Wen, Hao and Hao, Aiming and Song, Bingze and Wu, Meiqi and Wu, Jiahong and Chu, Xiangxiang and Lu, Sheng and Wang, Haoqian},
  journal={arXiv preprint arXiv:2512.24271},
  year={2025}
}