论文标题

使用网络爬行的数据填充Kalaallisut-English机器翻译系统

Finetuning a Kalaallisut-English machine translation system using web-crawled data

论文作者

Jones, Alex

论文摘要

西格陵兰族人被母语人士称为kalaallisut,是格陵兰岛约有56,000人使用的一种非常低的资源多合成语言。在这里,我们尝试使用来自大约30个多语言网站的Web爬行的伪造句子来验证预验证的Kalaallisut到英语神经机器翻译(NMT)系统。我们编译了一个超过93,000个Kalaallisut句子的语料库和超过140,000个丹麦的句子,然后使用跨语言句子嵌入和大约最近的邻居搜索搜索,以试图从这些语料库中挖掘近乎翻译。最后,我们将丹麦句子翻译成英语,以获得合成的kalaallisut-english对准语料库。尽管所得的数据集太小且嘈杂,无法改善经过验证的MT模型,但我们认为,借助额外的资源,我们可以构建一个更好的伪造型语料库,并在MT上取得更有希望的结果。我们还注意到单语kalaallisut数据的其他可能用途,并讨论了未来工作的方向。我们为我们的实验提供了代码和数据。

West Greenlandic, known by native speakers as Kalaallisut, is an extremely low-resource polysynthetic language spoken by around 56,000 people in Greenland. Here, we attempt to finetune a pretrained Kalaallisut-to-English neural machine translation (NMT) system using web-crawled pseudoparallel sentences from around 30 multilingual websites. We compile a corpus of over 93,000 Kalaallisut sentences and over 140,000 Danish sentences, then use cross-lingual sentence embeddings and approximate nearest-neighbors search in an attempt to mine near-translations from these corpora. Finally, we translate the Danish sentence to English to obtain a synthetic Kalaallisut-English aligned corpus. Although the resulting dataset is too small and noisy to improve the pretrained MT model, we believe that with additional resources, we could construct a better pseudoparallel corpus and achieve more promising results on MT. We also note other possible uses of the monolingual Kalaallisut data and discuss directions for future work. We make the code and data for our experiments publicly available.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源