Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, multi-view consistency remains challenging in edited results, and the extreme scarcity of paired 3D-consistent editing data makes supervised fine-tuning (SFT) impractical.
In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose RL3DEdit, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT.
Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images into it, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency.
Note: Due to GitHub file size limits, the video is provided in 720p HEVC format. For the 1080p version, please contact the authors.
@article{wang2026geometry,
title={Geometry-Guided Reinforcement Learning for Multi-view
Consistent 3D Scene Editing},
author={Wang, Jiyuan and Lin, Chunyu and Sun, Lei and Cao, Zhi
and Yin, Yuyang and Nie, Lang and Yuan, Zhenlong
and Chu, Xiangxiang and Wei, Yunchao and Liao, Kang
and others},
journal={arXiv preprint arXiv:2603.03143},
year={2026}
}