通过语言删除多语言模型的语言划分，精炼低资源的不监督翻译

论文标题

通过语言删除多语言模型的语言划分，精炼低资源的不监督翻译

Refining Low-Resource Unsupervised Translation by Language Disentanglement of Multilingual Model

论文作者

Nguyen, Xuan-Phi, Joty, Shafiq, Kui, Wu, Aw, Ai Ti

论文摘要

只有在模型在大规模的多语言环境中培训时，这些低资源和无关的语言（例如尼泊尔语或sinhala）的许多最近的无监督机器翻译（UMT）的工作表明，有能力的低资源和无关语言的无监督翻译，例如尼泊尔或辛哈拉（Sinhala）。尽管如此，尽管高资源语言极大地帮助启动了目标低资源翻译任务，但它们之间的语言差异可能会阻碍他们的进一步改进。在这项工作中，我们提出了一个简单的完善过程，以将语言与预先训练的多语言UMT模型分开，以仅专注于目标低资源任务。我们的方法在英语的完全无监督的翻译任务中实现了最新技术，分别为3.5、3.5、3.3、3.3、4.1、4.2、4.2和3.3，尼泊尔，僧伽罗，古吉拉特语，古吉拉特语，拉脱维亚语，爱沙尼亚语和哈萨克人的现状。我们的代码库可从https://github.com/nxphi47/refine_unsup_multlingual_mt获得

Numerous recent work on unsupervised machine translation (UMT) implies that competent unsupervised translations of low-resource and unrelated languages, such as Nepali or Sinhala, are only possible if the model is trained in a massive multilingual environment, where these low-resource languages are mixed with high-resource counterparts. Nonetheless, while the high-resource languages greatly help kick-start the target low-resource translation tasks, the language discrepancy between them may hinder their further improvement. In this work, we propose a simple refinement procedure to separate languages from a pre-trained multilingual UMT model for it to focus on only the target low-resource task. Our method achieves the state of the art in the fully unsupervised translation tasks of English to Nepali, Sinhala, Gujarati, Latvian, Estonian and Kazakh, with BLEU score gains of 3.5, 3.5, 3.3, 4.1, 4.2, and 3.3, respectively. Our codebase is available at https://github.com/nxphi47/refine_unsup_multilingual_mt

下载PDF全文

下载文献需遵守相关版权规定

论文标题