论文标题
低资源神经机器翻译中具有成本效益的培训
Cost-Effective Training in Low-Resource Neural Machine Translation
论文作者
论文摘要
在神经机器翻译(NMT)中探索了主动学习(AL)技术,但只有少数几项专注于解决有限数量句子可以翻译的低注释预算。这种情况尤其具有挑战性,对于濒临灭绝的语言,人类注释者很少或具有成本限制以标记大量数据。尽管AL被证明对大型预算有所帮助,但在这些低资源条件下建立高质量的翻译系统还不足。在这项工作中,我们提出了一种具有成本效益的培训程序,以利用少量注释句子和字典条目来提高NMT模型的性能。我们的方法利用单语言数据具有自我监督的目标和一个小规模,廉价的词典来进行其他监督,以在应用AL之前初始化NMT模型。我们表明,使用这些知识源的组合改进模型对于利用AL策略并在低资源条件下增加收益至关重要。我们还提出了一种新型的AL策略,该策略受到NMT的域适应的启发,并表明它对低预算有效。我们提出了一种新的混合数据驱动方法,该方法采样了与标记数据不同的句子,并且与未标记的数据最相似。最后,我们表明,与常规方法相比,初始化NMT模型并进一步使用AL策略可以实现高达$ 13 $ bleu的收益。
While Active Learning (AL) techniques are explored in Neural Machine Translation (NMT), only a few works focus on tackling low annotation budgets where a limited number of sentences can get translated. Such situations are especially challenging and can occur for endangered languages with few human annotators or having cost constraints to label large amounts of data. Although AL is shown to be helpful with large budgets, it is not enough to build high-quality translation systems in these low-resource conditions. In this work, we propose a cost-effective training procedure to increase the performance of NMT models utilizing a small number of annotated sentences and dictionary entries. Our method leverages monolingual data with self-supervised objectives and a small-scale, inexpensive dictionary for additional supervision to initialize the NMT model before applying AL. We show that improving the model using a combination of these knowledge sources is essential to exploit AL strategies and increase gains in low-resource conditions. We also present a novel AL strategy inspired by domain adaptation for NMT and show that it is effective for low budgets. We propose a new hybrid data-driven approach, which samples sentences that are diverse from the labelled data and also most similar to unlabelled data. Finally, we show that initializing the NMT model and further using our AL strategy can achieve gains of up to $13$ BLEU compared to conventional AL methods.