Diffusion2GAN can generate a 512px/1024px image at an interactive speed of 0.09/0.16 second. By learning the direct mapping from Gaussian noises to their corresponding images, Diffusion2GAN enables one-step image synthesis.
The images generated by Diffusion2GAN can be seamlessly upsampled to 4k res using GigaGAN upsampler. This indicates that we can generate low-resolution preview images using Diffusion2GAN and then enhance some preferred images to 4k resolution using the GigaGAN upsampler.
We propose a method to distill a complex multistep diffusion model into a single-step conditional GAN student model, dramatically accelerating inference while preserving image quality. Our approach interprets the diffusion distillation as a paired image-to-image translation task, using noise-to-image pairs of the diffusion model's ODE trajectory. For efficient regression loss computation, we propose E-LatentLPIPS, a Perceptual loss in the latent space of the diffusion model with an ensemble of augmentations. Despite dataset construction costs, E-LatentLPIPS converges more efficiently than many existing distillation methods. Furthermore, we adapt a diffusion model to construct a multi-scale discriminator with a text alignment loss to build an effective conditional GAN-based formulation. We demonstrate that our one-step generator outperforms cutting-edge one-step diffusion distillation models, DMD, SDXL-Turbo, and SDXL-Lightning, on the zero-shot COCO benchmark.
We collect diffusion model's output latents along with input noises and prompts. Then, the generator is trained to map noise and prompt to the target latent using our proposed E-LatentLPIPS regression loss and the GAN loss. While the output of the generator can be decoded by the SD latent decoder into RGB pixels, it is a compute intensive operation that is never performed during training.
Training a single iteration with LPIPS takes 117ms and 15.0GB extra memory on NVIDIA A100, whereas our E-LatentLPIPS requires 12.1ms and 0.6GB on the same device. Consequently, E-latentLPIPS accelerates the perceptual loss computation time by 9.7× compared to LPIPS, while simultaneously reducing memory consumption.
We reuse the pre-trained weights from the teacher model and augment it with multi-scale input and output branches. Concretely, we feed the resized version of input latents to each downsampling block of the encoder. For the decoder part, we enforce the discriminator to make real/fake predictions at three places at each scale: before, at, and after the skip connection. This multi-scale adversarial training further improves image quality.
We conduct an image reconstruction experiment by directly optimizing a single latent with different loss functions. Reconstruction with LPIPS roughly reproduces the target image, but at the cost of needing to decode into pixels. LatentLPIPS alone cannot precisely reconstruct the image. However, our ensembled augmentation, E-LatentLPIPS, can more precisely reconstruct the target while operating directly in the latent space.
We compare Diffusion2GAN with two concurrent one-step diffusion distillation generators: SDXL-turbo and SDXL-Lightning, as well as their teacher model, SDXL-Base-1.0. Although SDXL-Lightning produces more diverse images compared to Diffusion2GAN and SDXL-Turbo, it does so at the expense of FID and CLIP-score. On the other hand, Diffusion2GAN demonstrates better FID and CLIP-score than SDXL-Turbo and SDXL-Lightning but generates more diverse images compared to SDXL-Turbo.
We conduct a human preference study. SD1.5-Diffusion2GAN produces more realistic images with better text-to-image alignment compared to InstaFlow-0.9B. In SDXL distillation, SDXL-Diffusion2GAN shows superior realism and alignment compared to SDXL-Turbo and SDXL-Lightning. However, unlike COCO2014 and COCO2017, human preference favors teacher diffusion models. We leave developing a better evaluation metric for future work.
Diversity and text-to-image alignment of generated images from Diffusion2GAN. Diffusion2GAN can generate more diverse images than SDXL-Turbo while exhibiting better text-to-image alignment than SDXL-Lightning.
We would like to thank Tianwei Yin, Seungwook Kim, Sungyeon Kim for their valuable feedback and comments. Part of this work was done while Minguk Kang was an intern at Adobe Research. Minguk Kang and Suha Kwak were supported by the NRF grant and IITP grant funded by Ministry of Science and ICT, Korea (NRF-2021R1A2C3012728, AI Graduate School (POSTECH): IITP-2019-0-01906). Jaesik Park was supported by the IITP grant funded by the government of South Korea (MSIT) (AI Graduate School (SNU): 2021-0-01343 and AI Innovation Hub: 2021-0-02068). Jun-Yan Zhu was supported by the Packard Fellowship.
@inproceedings{kang2024diffusion2gan,
author = {Kang, Minguk and Zhang, Richard and Barnes, Connelly and Paris, Sylvain and Kwak, Suha and Park, Jaesik and Shechtman, Eli and Zhu, Jun-Yan and Park, Taesung},
title = {{Distilling Diffusion Models into Conditional GANs}},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2024},
}