MeGapixel图像生成，带有逐步的Denoising Autocododer

论文标题

MeGapixel图像生成，带有逐步的Denoising Autocododer

Megapixel Image Generation with Step-Unrolled Denoising Autoencoders

论文作者

McKinney, Alex F., Willcocks, Chris G.

论文摘要

生成建模研究的持续趋势是将样本分辨率推高更高，同时降低训练和抽样的计算要求。我们的目标是通过技术的组合进一步推动这一趋势 - 每个组件代表当前各个领域的效率。其中包括载体定量的GAN（VQ-GAN），该模型具有高水平的损耗 - 但在感知上微不足道的压缩模型；沙漏变形金刚，高度可扩展的自我注意力模型；和逐步未授予的Denoising自动编码器（Sundae），一种非自动化（NAR）文本生成模型。出乎意料的是，当应用于多维数据时，我们的方法突显了沙漏变压器的原始公式中的弱点。鉴于此，我们建议对重采样机制进行修改，该机制适用于将分层变压器应用于多维数据的任何任务。此外，我们证明了圣代表到长序列长度的可伸缩性 - 比先前的工作长四倍。我们提出的框架秤达到高分辨率（$ 1024 \ times 1024 $），并迅速训练（2-4天）。至关重要的是，训练有素的模型在消费级GPU（GTX 1080TI）的大约2秒内生产了多样化和现实的百像样品。通常，该框架是灵活的：支持任意数量的采样步骤，示例自动插入，自校正功能，有条件生成和NAR公式，以允许任意介绍遮罩。我们在FFHQ256上获得10.56的FID分数 - 仅在100个采样步骤中以不到一半的采样步骤接近原始VQ -GAN，而FFHQ1024的FFHQ1024和21.85。

An ongoing trend in generative modelling research has been to push sample resolutions higher whilst simultaneously reducing computational requirements for training and sampling. We aim to push this trend further via the combination of techniques - each component representing the current pinnacle of efficiency in their respective areas. These include vector-quantized GAN (VQ-GAN), a vector-quantization (VQ) model capable of high levels of lossy - but perceptually insignificant - compression; hourglass transformers, a highly scaleable self-attention model; and step-unrolled denoising autoencoders (SUNDAE), a non-autoregressive (NAR) text generative model. Unexpectedly, our method highlights weaknesses in the original formulation of hourglass transformers when applied to multidimensional data. In light of this, we propose modifications to the resampling mechanism, applicable in any task applying hierarchical transformers to multidimensional data. Additionally, we demonstrate the scalability of SUNDAE to long sequence lengths - four times longer than prior work. Our proposed framework scales to high-resolutions ($1024 \times 1024$) and trains quickly (2-4 days). Crucially, the trained model produces diverse and realistic megapixel samples in approximately 2 seconds on a consumer-grade GPU (GTX 1080Ti). In general, the framework is flexible: supporting an arbitrary number of sampling steps, sample-wise self-stopping, self-correction capabilities, conditional generation, and a NAR formulation that allows for arbitrary inpainting masks. We obtain FID scores of 10.56 on FFHQ256 - close to the original VQ-GAN in less than half the sampling steps - and 21.85 on FFHQ1024 in only 100 sampling steps.

下载PDF全文

下载文献需遵守相关版权规定

论文标题