Advancing End-to-End Pixel-Space Generative Modeling via Self-Superivsed Pre-training

Advancing End-To-End Pixel-Space Generative Modeling via Self-Supervised Pre-training

Jiachen Lei¹

Keli Liu¹

Julius Berner²

Haiming Yu¹

Hongkai Zheng²

Jiahong Wu¹

Xiangxiang Chu¹

¹AMAP, Alibaba Group

²Caltech

Paper

Code

TL;DR: No reliance on any external models (e.g. VAE or DINO), a noval training framework for pixel-space image generation: SSL pre-training + end-to-end fine-tuning = SOTA performance on ImageNet-256/512

Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our training framework demonstrates strong empirical performance on ImageNet dataset. Specifically, our diffusion model reaches an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE), surpassing prior pixel-space methods by a large margin in both generation quality and efficiency while rivaling leading VAE-based models at comparable training cost. Furthermore, on ImageNet-256, our consistency model achieves an impressive FID of 8.82 in a single sampling step, significantly surpassing its latent-space counterpart. To the best of our knowledge, this marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models.

Results | Diffusion model on ImageNet-256

Table 1: System-level comparison on ImageNet-256 with CFG. For latent-space models, we display model parameters and sampling GFLOPs of both the VAE and the generative model. We report GFLOPs of our EPG following DiT. †: both model parameters and GFLOPs are composed of two generative models and one classifier. Text in gray: method that requires external models in addition to VAE.

Results | Consistency model on ImageNet-256

Table 2: Benchmarking few-step class-conditional image generation on ImageNet-256.

Results | Diffusion model on ImageNet-512

Table 3: System-level comparison on ImageNet-512 with CFG. Evaluation settings are the same as Table 1.

Methodology

Figure 2: Overview of our method. (Left) Our pre-training framework. c is a learnable token [CLS], t0, tn, tn−1 are time conditions, (y1, y2) are augmented views of clean image xt0, (xtn , xtn−1 ) are temporally adjacent points from the same PF ODE trajectory with xt0 as the initial point. θ⁻ is the exponential moving average of θ, and sg is the stop gradient operation. (Right) Fine-tuning stage. After pre-training, we discard the projector and train Eθ along side a randomly initialized decoder Dθ in downstream generation tasks.

Prior works have explored training diffusion models directly on raw pixels, developing specialized model architectures or training techniques. Despite these efforts, none have achieved training performance and inference efficiency comparable to VAE-based methods primarily due to the high computational cost originating from the model backbone or the slow convergence rate, representing the two major challenges of operating in pixel space.

In this work, we address this by decomposing the training paradigm into two separate stages> as in SSL: (1) Pre-training: we pre-train encoders to capture visual semantics from clean images while aligning them with corresponding points on the same deterministic sampling trajectory, which evolves points from pure Gaussian to the data distribution. In practice, this is realized by matching images with shared noise but different noise levels as in consistency tuning. This pre-training approach reformulates representation learning on noisy images as a generative alignment task, connecting features of noisy samples to their progressively cleaner versions. (2) Fine-tuning: Given the pre-trained encoders, we fine-tune it alongside a randomly initialized decoder in an end-to-end manner under task-specific configurations.

Citation

If you find our work interesting, please consider citing our paper:

@misc{lei2025advancingendtoendpixelspace, title={Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training}, author={Jiachen Lei and Keli Liu and Julius Berner and Haiming Yu and Hongkai Zheng and Jiahong Wu and Xiangxiang Chu}, year={2025}, eprint={2510.12586}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.12586}, }