论文标题
上尉:学习deNoed序列表示的对比预训练
CAPT: Contrastive Pre-Training for Learning Denoised Sequence Representations
论文作者
论文摘要
预先训练的自我监督模型(例如BERT)在学习顺序表示方面取得了惊人的成功,尤其是在自然语言处理方面。这些模型通常用某些类型的噪声(例如掩盖,洗牌或替换)损坏给定序列,然后尝试恢复原始输入。但是,这种预训练方法容易与噪音协变的学习表示形式,从而导致训练和微调阶段之间的差异。为了解决这个问题,我们提出了对比的预训练(上尉),以学习噪声不变序列表示。拟议的上尉通过无监督的实例训练信号鼓励原始序列的表示与其损坏版本之间的一致性。通过这种方式,它不仅减轻了预训练的噪音引起的预处理差异,而且还可以通过更有效的句子级别的监督来更好地捕获预训练的模型。 Different from most prior work that focuses on a particular modality, comprehensive empirical evidence on 11 natural language understanding and cross-modal tasks illustrates that CAPT is applicable for both language and vision-language tasks, and obtains surprisingly consistent improvement, including 0.6\% absolute gain on GLUE benchmarks and 0.8\% absolute increment on $\text{NLVR}^2$.
Pre-trained self-supervised models such as BERT have achieved striking success in learning sequence representations, especially for natural language processing. These models typically corrupt the given sequences with certain types of noise, such as masking, shuffling, or substitution, and then try to recover the original input. However, such pre-training approaches are prone to learning representations that are covariant with the noise, leading to the discrepancy between the pre-training and fine-tuning stage. To remedy this, we present ContrAstive Pre-Training (CAPT) to learn noise invariant sequence representations. The proposed CAPT encourages the consistency between representations of the original sequence and its corrupted version via unsupervised instance-wise training signals. In this way, it not only alleviates the pretrain-finetune discrepancy induced by the noise of pre-training, but also aids the pre-trained model in better capturing global semantics of the input via more effective sentence-level supervision. Different from most prior work that focuses on a particular modality, comprehensive empirical evidence on 11 natural language understanding and cross-modal tasks illustrates that CAPT is applicable for both language and vision-language tasks, and obtains surprisingly consistent improvement, including 0.6\% absolute gain on GLUE benchmarks and 0.8\% absolute increment on $\text{NLVR}^2$.