论文标题
不再是低资源:孟加拉国英语机器翻译的结算机结合,批处理过滤和新数据集
Not Low-Resource Anymore: Aligner Ensembling, Batch Filtering, and New Datasets for Bengali-English Machine Translation
论文作者
论文摘要
尽管孟加拉语是世界上第七位最广泛的语言,但由于资源较低,在机器翻译文献中受到了较少的关注。孟加拉语的大多数公开并行公司都不够大;并且质量较差,主要是由于错误的句子分割导致的句子对准不正确,也是由于其中存在的噪声大量。在这项工作中,我们为孟加拉语构建了一个自定义的句子细分器,并提出了两种新颖的方法,用于在低资源设置上创建并行语料库:对齐器结合和批处理过滤。通过分段器和两种方法,我们汇编了一个由275万次句子构成的高质量孟加拉语平行语料库,其中超过200万次以前没有可用。对神经模型的培训,我们比以前的孟加拉语机器翻译方法提高了9个以上的BLEU分数。我们还通过通过广泛的质量控制制成的1000对新测试集进行了评估。我们释放分段,平行语料库和评估集,从而使孟加拉人的低资源状态提升。据我们所知,这是有关孟加拉语英语机器翻译的首次大规模研究。我们认为,我们的研究将为未来对孟加拉语英语机器翻译以及其他低资源语言的研究铺平道路。我们的数据和代码可在https://github.com/csebuetnlp/banglanmt上找到。
Despite being the seventh most widely spoken language in the world, Bengali has received much less attention in machine translation literature due to being low in resources. Most publicly available parallel corpora for Bengali are not large enough; and have rather poor quality, mostly because of incorrect sentence alignments resulting from erroneous sentence segmentation, and also because of a high volume of noise present in them. In this work, we build a customized sentence segmenter for Bengali and propose two novel methods for parallel corpus creation on low-resource setups: aligner ensembling and batch filtering. With the segmenter and the two methods combined, we compile a high-quality Bengali-English parallel corpus comprising of 2.75 million sentence pairs, more than 2 million of which were not available before. Training on neural models, we achieve an improvement of more than 9 BLEU score over previous approaches to Bengali-English machine translation. We also evaluate on a new test set of 1000 pairs made with extensive quality control. We release the segmenter, parallel corpus, and the evaluation set, thus elevating Bengali from its low-resource status. To the best of our knowledge, this is the first ever large scale study on Bengali-English machine translation. We believe our study will pave the way for future research on Bengali-English machine translation as well as other low-resource languages. Our data and code are available at https://github.com/csebuetnlp/banglanmt.