论文标题
关于引导扩散模型的蒸馏
On Distillation of Guided Diffusion Models
论文作者
论文摘要
无分类器的引导扩散模型最近已显示在高分辨率图像生成时非常有效,并且已广泛用于大规模扩散框架中,包括Dalle-2,稳定的扩散和成像。但是,无分类器的引导扩散模型的缺点是,它们在推理时间上的计算价格昂贵,因为它们需要评估两个扩散模型,一个类别的模型和一个无条件的模型,数量达到数百次。 To deal with this limitation, we propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from: Given a pre-trained classifier-free guided model, we first learn a single model to match the output of the combined conditional and unconditional models, and then we progressively distill that model to a diffusion model that requires much fewer sampling steps.对于在像素空间上训练的标准扩散模型,我们的方法能够在ImageNet 64x64和Cifar-10上使用少于4个采样步骤的原始模型的图像生成图像,实现了与原始模型的分数相当,而最多可从256次获得样品,从而实现了FID/。对于在潜在空间上训练的扩散模型(例如稳定的扩散),我们的方法能够使用少于1到4个脱氧步骤来生成高效率图像,与Imagenet 256x256和Laion数据集的现有方法相比,与现有方法相比,至少将推断至少10倍。我们进一步证明了我们的方法对文本指导的图像编辑和介入的有效性,在此过程中,我们的蒸馏模型能够使用较少的2-4个脱氧步骤生成高质量的结果。
Classifier-free guided diffusion models have recently been shown to be highly effective at high-resolution image generation, and they have been widely used in large-scale diffusion frameworks including DALLE-2, Stable Diffusion and Imagen. However, a downside of classifier-free guided diffusion models is that they are computationally expensive at inference time since they require evaluating two diffusion models, a class-conditional model and an unconditional model, tens to hundreds of times. To deal with this limitation, we propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from: Given a pre-trained classifier-free guided model, we first learn a single model to match the output of the combined conditional and unconditional models, and then we progressively distill that model to a diffusion model that requires much fewer sampling steps. For standard diffusion models trained on the pixel-space, our approach is able to generate images visually comparable to that of the original model using as few as 4 sampling steps on ImageNet 64x64 and CIFAR-10, achieving FID/IS scores comparable to that of the original model while being up to 256 times faster to sample from. For diffusion models trained on the latent-space (e.g., Stable Diffusion), our approach is able to generate high-fidelity images using as few as 1 to 4 denoising steps, accelerating inference by at least 10-fold compared to existing methods on ImageNet 256x256 and LAION datasets. We further demonstrate the effectiveness of our approach on text-guided image editing and inpainting, where our distilled model is able to generate high-quality results using as few as 2-4 denoising steps.