Diffusion2GAN can learn noise-to-image mapping of a target diffusion model

One-step image synthesis using Diffusion2GAN

Diffusion2GAN can generate a 512px/1024px image at an interactive speed of 0.09/0.16 second. By learning the direct mapping from Gaussian noises to their corresponding images, Diffusion2GAN enables one-step image synthesis.

Two-step 4K image generation using Diffusion2GAN and GigaGAN

The images generated by Diffusion2GAN can be seamlessly upsampled to 4k res using GigaGAN upsampler. This indicates that we can generate low-resolution preview images using Diffusion2GAN and then enhance some preferred images to 4k resolution using the GigaGAN upsampler.

Abstract

We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference while preserving image quality. Our approach interprets the diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a Perceptual loss in the latent space of the diffusion model with an ensemble of augmentations. Despite dataset construction costs, E-LatentLPIPS converges more efficiently than many existing distillation methods. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models, DMD, SDXL-Turbo, and SDXL-Lightning, on the zero-shot COCO benchmark.

Diffusion2GAN Framework

Diffusion2GAN one-step generator

We collect diffusion model's output latents along with input noises and prompts. Then, the generator is trained to map noise and prompt to the target latent using our proposed E-LatentLPIPS regression loss and the GAN loss. While the output of the generator can be decoded by the SD latent decoder into RGB pixels, it is a compute intensive operation that is never performed during training.


E-LatentLPIPS for latent space distillation

Training a single iteration with LPIPS takes 117ms and 15.0GB extra memory on NVIDIA A100, whereas our E-LatentLPIPS requires 12.1ms and 0.6GB on the same device. Consequently, E-latentLPIPS accelerates the perceptual loss computation time by 9.7× compared to LPIPS, while simultaneously reducing memory consumption.


Diffusion2GAN multi-scale conditional discriminator

We reuse the pre-trained weights from the teacher model and augment it with multi-scale input and output branches. Concretely, we feed the resized version of input latents to each downsampling block of the encoder. For the decoder part, we enforce the discriminator to make real/fake predictions at three places at each scale: before, at, and after the skip connection. This multi-scale adversarial training further improves image quality.


Experimental Results

Single image reconstruction using LPIPS variants

We conduct an image reconstruction experiment by directly optimizing a single latent with different loss functions. Reconstruction with LPIPS roughly reproduces the target image, but at the cost of needing to decode into pixels. LatentLPIPS alone cannot precisely reconstruct the image. However, our ensembled augmentation, E-LatentLPIPS, can more precisely reconstruct the target while operating directly in the latent space.

Comparison with cutting-edge distillation models

We compare Diffusion2GAN with two concurrent one-step diffusion distillation generators: SDXL-turbo and SDXL-Lightning, as well as their teacher model, SDXL-Base-1.0. Although SDXL-Lightning produces more diverse images compared to Diffusion2GAN and SDXL-Turbo, it does so at the expense of FID and CLIP-score. On the other hand, Diffusion2GAN demonstrates better FID and CLIP-score than SDXL-Turbo and SDXL-Lightning but generates more diverse images compared to SDXL-Turbo.


COCO2014 benchmark (Stable Diffusion 1.5)

COCO2017 benchmark (SDXL-Base-1.0)

Human preference study

We conduct a human preference study. SD1.5-Diffusion2GAN produces more realistic images with better text-to-image alignment compared to InstaFlow-0.9B. In SDXL distillation, SDXL-Diffusion2GAN shows superior realism and alignment compared to SDXL-Turbo and SDXL-Lightning. However, unlike COCO2014 and COCO2017, human preference favors teacher diffusion models. We leave developing a better evaluation metric for future work.


Diverse Image Generation using Diffusion2GAN

Diversity and text-to-image alignment of generated images from Diffusion2GAN. Diffusion2GAN can generate more diverse images than SDXL-Turbo while exhibiting better text-to-image alignment than SDXL-Lightning.


Related Works

  • Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.

  • Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. In International Conference on Learning Representation (ICLR), 2024.

  • Markus Kettunen, Erik Härkönen, and Jaakko Lehtinen. E-LPIPS: Robust Perceptual Image Similarity via Random Transformation Ensembles. In arXiv preprint arXiv:1906.03973, 2019.

  • Tianwei Yin, Michael Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, and Taesung Park. One-step Diffusion with Distribution Matching Distillation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

  • Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial Diffusion Distillation. In arXiv preprint arXiv:2311.17042, 2023.

  • Shanchuan Lin, Anran Wang, and Xiao Yang. SDXL-Lightning: Progressive Adversarial Diffusion Distillation. In arXiv preprint arXiv:2402.13929, 2024.

  • Acknowledgements

    We would like to thank Tianwei Yin, Seungwook Kim, Sungyeon Kim for their valuable feedback and comments. Part of this work was done while Minguk Kang was an intern at Adobe Research. Minguk Kang and Suha Kwak were supported by the NRF grant and IITP grant funded by Ministry of Science and ICT, Korea (NRF-2021R1A2C3012728, AI Graduate School (POSTECH): IITP-2019-0-01906). Jaesik Park was supported by the IITP grant funded by the government of South Korea (MSIT) (AI Graduate School (SNU): 2021-0-01343 and AI Innovation Hub: 2021-0-02068). Jun-Yan Zhu was supported by the Packard Fellowship.

    BibTeX

    @inproceedings{kang2024diffusion2gan,
      author    = {Kang, Minguk and Zhang, Richard and Barnes, Connelly and Paris, Sylvain and Kwak, Suha and Park, Jaesik and Shechtman, Eli and Zhu, Jun-Yan and Park, Taesung},
      title     = {{Distilling Diffusion Models into Conditional GANs}},
      booktitle   = {European Conference on Computer Vision (ECCV)},
      year      = {2024},
    }