UNILMV2：用于统一语言模型预训练的伪屏蔽语言模型

论文标题

UNILMV2：用于统一语言模型预训练的伪屏蔽语言模型

UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training

论文作者

Bao, Hangbo, Dong, Li, Wei, Furu, Wang, Wenhui, Yang, Nan, Liu, Xiaodong, Wang, Yu, Piao, Songhao, Gao, Jianfeng, Zhou, Ming, Hon, Hsiao-Wuen

论文摘要

我们建议使用新颖的培训程序（称为伪掩盖的语言模型（PMLM））进行自动编码和部分自动回归语言建模任务的统一语言模型。鉴于带有掩盖令牌的输入文本，我们依靠传统的掩码来通过自动编码来学习损坏的令牌和上下文之间的相互关系，并通过部分自动回收的建模来学习掩盖跨度之间的跨关系。通过精心设计的位置嵌入和自我发项式掩码，上下文编码被重复使用以避免冗余计算。此外，用于自动编码的常规掩码提供全局掩蔽信息，以便可以以部分自动回归语言建模访问所有位置嵌入。此外，这两个任务分别将统一的语言模型作为双向编码器和序列到序列解码器。我们的实验表明，使用PMLM预先训练的统一语言模型可以在几种广泛使用的基准中对广泛的自然语言理解和发电任务实现新的最新结果。

We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks using a novel training procedure, referred to as a pseudo-masked language model (PMLM). Given an input text with masked tokens, we rely on conventional masks to learn inter-relations between corrupted tokens and context via autoencoding, and pseudo masks to learn intra-relations between masked spans via partially autoregressive modeling. With well-designed position embeddings and self-attention masks, the context encodings are reused to avoid redundant computation. Moreover, conventional masks used for autoencoding provide global masking information, so that all the position embeddings are accessible in partially autoregressive language modeling. In addition, the two tasks pre-train a unified language model as a bidirectional encoder and a sequence-to-sequence decoder, respectively. Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks across several widely used benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题