From Editor to Dense Geometry Estimator

1Beijing Jiaotong University   2Alibaba Group
3Chongqing University of Posts and Telecommunications   4Nanyang Technological University
*Corresponding author   Project leader
Teaser image demonstrating FE2E dense geometry prediction.

We present FE2E, a DiT-based foundation model for monocular dense geometry prediction. Trained with limited supervision, FE2E achieves promising performance improvements in zero-shot depth and normal estimation. The bar length indicates average ranking across all metrics from multiple datasets (lower is better). ⭐ represents training data amount.

Overview

Leveraging visual priors from pre-trained text-to-image generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than generators, may be a more suitable foundation for fine-tuning.

Our systematic analysis reveals that editing models possess inherent structural priors, enabling them to converge more stably by "refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce FE2E, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. To tailor the editor for deterministic tasks, we reformulate the editor's original flow matching loss into the "consistent velocity" training objective and use logarithmic quantization to resolve precision conflicts.

Additionally, we leverage the DiT's built-in global attention mechanism to design a cost-free joint estimation strategy, which allows the model to output both depth and normals in a single forward pass. Without scaling up training data, FE2E achieves significant performance improvements in both zero-shot monocular depth and normal estimation.

The gallery below presents several images and comparisons of FE2E with previous state-of-the-art methods. Use the slider and gestures to reveal details on both sides.

Gallery

How it works

FE2E Adaptation Pipeline

FE2E adapts a pre-trained image editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. A pre-trained VAE encodes the logarithmically quantized depth $\mathbf{d}$, input image $\mathbf{x}$, and normals $\mathbf{n}$ into latent space. The DiT $f_\theta$ learns a constant velocity $\mathbf{v}$ from a fixed origin $\mathbf{z}^y_0$ to the target latent $\mathbf{z}^y_1$, independent of $t$ or instructions. By repurposing the discarded output region, FE2E jointly predicts depth and normals without extra computation.

FE2E adaptation pipeline

Quantitative Zero-shot Depth Comparison

Despite being trained on only 71K images, FE2E significantly outperforms recent methods, including over 35% performance gains on the ETH3D dataset and outperforms the DepthAnything series which is trained on 100× more data. This highlights the effectiveness of leveraging editing model priors rather than simply scaling up training data.

Zero-shot Depth Comparison with other methods

Quantitative Zero-shot Normal Comparison

The best and second best performances are highlighted. ⭐ denotes the method relies on pre-trained Stable Diffusion.

Zero-shot Normal Comparison with other methods

Refer to the pdf paper linked above for more details on qualitative, quantitative, and ablation studies.

Citation