使用自学的自我裁判利用单语数据进行多语言神经机器翻译

论文标题

使用自学的自我裁判利用单语数据进行多语言神经机器翻译

Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation

论文作者

Siddhant, Aditya, Bapna, Ankur, Cao, Yuan, Firat, Orhan, Chen, Mia, Kudugunta, Sneha, Arivazhagan, Naveen, Wu, Yonghui

论文摘要

在过去的几年中，出现了低资源神经机器翻译（NMT）的两个有前途的研究指示。第一个重点是利用高资源语言通过多语言NMT提高低资源语言的质量。第二个方向采用单语言数据进行自学的单语言数据，以进行预训练的翻译模型，然后对少量监督数据进行微调。在这项工作中，我们加入了这两条研究线，并在多语言NMT中使用自学意义者证明了单语数据的功效。我们提供三个主要结果：（i）使用单语言数据可显着提高多语言模型中低资源语言的翻译质量。（ii）自我规定改善了多语言模型中的零击翻译质量。（iii）使用自学的利用单语言数据为在多语言模型中添加新语言提供了可行的途径，在RO-EN翻译上最多可获得33个BLEU，而无需任何平行数据或反向翻译。

Over the last few years two promising research directions in low-resource neural machine translation (NMT) have emerged. The first focuses on utilizing high-resource languages to improve the quality of low-resource languages via multilingual NMT. The second direction employs monolingual data with self-supervision to pre-train translation models, followed by fine-tuning on small amounts of supervised data. In this work, we join these two lines of research and demonstrate the efficacy of monolingual data with self-supervision in multilingual NMT. We offer three major results: (i) Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models. (ii) Self-supervision improves zero-shot translation quality in multilingual models. (iii) Leveraging monolingual data with self-supervision provides a viable path towards adding new languages to multilingual models, getting up to 33 BLEU on ro-en translation without any parallel data or back-translation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题