剪贴画：与剪辑的文本对图像生成器的无语言培训

论文标题

剪贴画：与剪辑的文本对图像生成器的无语言培训

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

论文作者

Wang, Zihao, Liu, Wei, He, Qian, Wu, Xinglong, Yi, Zili

论文摘要

在通用域中培训文本对图像生成器（例如Dall.E，Cogview）需要大量的配对文本图像数据，这太贵了，无法收集。在本文中，我们提出了一种自我监督的方案，称为剪贴画，用于一般文本对图像生成，并使用具有预训练的剪辑模型提取的语言图像先验。在我们的方法中，我们只需要一组通用域中的未标记图像来训练文本对图像生成器。具体而言，给定一个没有文本标签的图像，我们首先将图像的嵌入在United语言视觉嵌入空间中使用剪辑的图像编码器提取。接下来，我们将图像转换为VQGAN CodeBook Space中的一系列离散令牌（VQGAN模型可以使用手中的未标记图像数据集训练）。最后，我们训练一种自动回旋变压器，该变压器从其统一的语言视觉表示中映射图像令牌。训练后，变压器可以根据输入文本上从剪辑的文本编码中提取的文本生成相干图像令牌。这样的策略使我们能够培训具有大型无文本图像数据集（例如ImageNet）的强大而通用的文本对图像生成器。定性和定量评估证明，我们的方法在图像质量方面显着优于基于优化的文本对图像方法，而不会损害文本图像匹配。我们的方法甚至可以与Cogview等旗舰监督模型达到可比的性能。

Training a text-to-image generator in the general domain (e.g., Dall.e, CogView) requires huge amounts of paired text-image data, which is too expensive to collect. In this paper, we propose a self-supervised scheme named as CLIP-GEN for general text-to-image generation with the language-image priors extracted with a pre-trained CLIP model. In our approach, we only require a set of unlabeled images in the general domain to train a text-to-image generator. Specifically, given an image without text labels, we first extract the embedding of the image in the united language-vision embedding space with the image encoder of CLIP. Next, we convert the image into a sequence of discrete tokens in the VQGAN codebook space (the VQGAN model can be trained with the unlabeled image dataset in hand). Finally, we train an autoregressive transformer that maps the image tokens from its unified language-vision representation. Once trained, the transformer can generate coherent image tokens based on the text embedding extracted from the text encoder of CLIP upon an input text. Such a strategy enables us to train a strong and general text-to-image generator with large text-free image dataset such as ImageNet. Qualitative and quantitative evaluations verify that our method significantly outperforms optimization-based text-to-image methods in terms of image quality while not compromising the text-image matching. Our method can even achieve comparable performance as flagship supervised models like CogView.

下载PDF全文

下载文献需遵守相关版权规定

论文标题