DTGAN：文本对图像生成的双重注意生成对抗网络

论文标题

DTGAN：文本对图像生成的双重注意生成对抗网络

DTGAN: Dual Attention Generative Adversarial Networks for Text-to-Image Generation

论文作者

Zhang, Zhenxing, Schomaker, Lambert

论文摘要

大多数现有的文本对图像生成方法采用多阶段模块化体系结构，该架构有三个重要的问题：1）训练多个网络增加了运行时间并影响生成模型的收敛性和稳定性； 2）这些方法忽略了早期发电机图像的质量； 3）许多歧视者需要接受培训。为此，我们提出了双重注意生成对抗网络（DTGAN），该网络可以仅使用单个发生器/鉴别器对综合高质量和语义上一致的图像。提出的模型介绍了频道感知和像素吸引的注意模块，这些模块可以指导发电机基于全局句子向量专注于与文本相关的通道和像素，并使用注意力重量来微调原始特征图。同样，提出有条件的自适应实例层归一化（Cadailn），以帮助我们的注意力模块通过输入自然语言描述灵活地控制形状和纹理的变化量。此外，通过确保生动的形状和感知均匀的颜色分布的生动形状和产生的图像的颜色分布来增强图像分辨率。基准数据集上的实验结果证明了我们所提出的方法与具有多阶段框架的最先进模型相比。注意力图的可视化表明，频道感知的注意模块能够定位判别区域，而Pixel-Aware Active模块具有捕获图像生成的全球视觉内容的能力。

Most existing text-to-image generation methods adopt a multi-stage modular architecture which has three significant problems: 1) Training multiple networks increases the run time and affects the convergence and stability of the generative model; 2) These approaches ignore the quality of early-stage generator images; 3) Many discriminators need to be trained. To this end, we propose the Dual Attention Generative Adversarial Network (DTGAN) which can synthesize high-quality and semantically consistent images only employing a single generator/discriminator pair. The proposed model introduces channel-aware and pixel-aware attention modules that can guide the generator to focus on text-relevant channels and pixels based on the global sentence vector and to fine-tune original feature maps using attention weights. Also, Conditional Adaptive Instance-Layer Normalization (CAdaILN) is presented to help our attention modules flexibly control the amount of change in shape and texture by the input natural-language description. Furthermore, a new type of visual loss is utilized to enhance the image resolution by ensuring vivid shape and perceptually uniform color distributions of generated images. Experimental results on benchmark datasets demonstrate the superiority of our proposed method compared to the state-of-the-art models with a multi-stage framework. Visualization of the attention maps shows that the channel-aware attention module is able to localize the discriminative regions, while the pixel-aware attention module has the ability to capture the globally visual contents for the generation of an image.

下载PDF全文

下载文献需遵守相关版权规定

论文标题