Leveraging visual priors from pre-trained text-to-image generative models has shown success in dense prediction. However, dense prediction is inherently an image-to-image task, suggesting that image editing models, rather than generators, may be a more suitable foundation for fine-tuning.
Our systematic analysis reveals that editing models possess inherent structural priors, enabling them to converge more stably by "refining" their innate features, and ultimately achieve higher performance than their generative counterparts. Based on these findings, we introduce FE2E, a framework that pioneeringly adapts an advanced editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. To tailor the editor for deterministic tasks, we reformulate the editor's original flow matching loss into the "consistent velocity" training objective and use logarithmic quantization to resolve precision conflicts.
Additionally, we leverage the DiT's built-in global attention mechanism to design a cost-free joint estimation strategy, which allows the model to output both depth and normals in a single forward pass. Without scaling up training data, FE2E achieves significant performance improvements in both zero-shot monocular depth and normal estimation.
The gallery below presents several images and comparisons of FE2E with previous state-of-the-art methods. Use the slider and gestures to reveal details on both sides.
FE2E adapts a pre-trained image editing model based on Diffusion Transformer (DiT) architecture for dense geometry prediction. A pre-trained VAE encodes the logarithmically quantized depth $\mathbf{d}$, input image $\mathbf{x}$, and normals $\mathbf{n}$ into latent space. The DiT $f_\theta$ learns a constant velocity $\mathbf{v}$ from a fixed origin $\mathbf{z}^y_0$ to the target latent $\mathbf{z}^y_1$, independent of $t$ or instructions. By repurposing the discarded output region, FE2E jointly predicts depth and normals without extra computation.
Despite being trained on only 71K images, FE2E significantly outperforms recent methods, including over 35% performance gains on the ETH3D dataset and outperforms the DepthAnything series which is trained on 100× more data. This highlights the effectiveness of leveraging editing model priors rather than simply scaling up training data.
The best and second best performances are highlighted. ⭐ denotes the method relies on pre-trained Stable Diffusion.
Refer to the pdf paper linked above for more details on qualitative, quantitative, and ablation studies.