揭示文本到图像扩散模型中的分离能力

论文标题

揭示文本到图像扩散模型中的分离能力

Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models

论文作者

Wu, Qiucheng, Liu, Yujian, Zhao, Handong, Kale, Ajinkya, Bui, Trung, Yu, Tong, Lin, Zhe, Zhang, Yang, Chang, Shiyu

论文摘要

生成模型已在计算机视觉中得到广泛研究。最近，由于其生成的图像的高质量，扩散模型引起了很大的关注。图像生成模型的关键所需属性是能够解开不同属性的能力，该属性应在不更改语义内容的情况下对样式进行修改，并且修改参数应推广到不同的图像。先前的研究发现，生成的对抗网络（GAN）固有地具有这种分离能力，因此他们可以执行无脑膜编辑而无需重新训练或微调网络。在这项工作中，我们探讨了扩散模型是否也固有地配备了这种功能。我们的发现是，对于稳定的扩散模型，通过部分将嵌入的输入文本从中性描述（例如，“人的照片”）转换为具有样式（例如，“有微笑的人的照片”），同时固定所有高斯随机噪声，而在DeNoising过程中引入了所有高斯随机噪声，可以将生成的图像修改为无需更改目标内容而不更改语义内容。基于这一发现，我们进一步提出了一种简单，轻巧的图像编辑算法，其中两个文本嵌入的混合权重优化了用于样式匹配和内容保存的优化。整个过程仅涉及优化大约50个参数，并且不会微调扩散模型本身。实验表明，所提出的方法可以修改各种属性，其性能优于基于基于扩散模型的图像编辑算法，需要进行微调。优化的权重可以很好地推广到不同的图像。我们的代码可在https://github.com/ucsb-nlp-chang/diffusiondisentangement上公开获取。

Generative models have been widely studied in computer vision. Recently, diffusion models have drawn substantial attention due to the high quality of their generated images. A key desired property of image generative models is the ability to disentangle different attributes, which should enable modification towards a style without changing the semantic content, and the modification parameters should generalize to different images. Previous studies have found that generative adversarial networks (GANs) are inherently endowed with such disentanglement capability, so they can perform disentangled image editing without re-training or fine-tuning the network. In this work, we explore whether diffusion models are also inherently equipped with such a capability. Our finding is that for stable diffusion models, by partially changing the input text embedding from a neutral description (e.g., "a photo of person") to one with style (e.g., "a photo of person with smile") while fixing all the Gaussian random noises introduced during the denoising process, the generated images can be modified towards the target style without changing the semantic content. Based on this finding, we further propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation. This entire process only involves optimizing over around 50 parameters and does not fine-tune the diffusion model itself. Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms that require fine-tuning. The optimized weights generalize well to different images. Our code is publicly available at https://github.com/UCSB-NLP-Chang/DiffusionDisentanglement.

下载PDF全文

下载文献需遵守相关版权规定

论文标题