数据增强以解决低资源的僧伽罗 - 英语神经机器翻译中的杂质分类问题

论文标题

数据增强以解决低资源的僧伽罗 - 英语神经机器翻译中的杂质分类问题

Data Augmentation to Address Out-of-Vocabulary Problem in Low-Resource Sinhala-English Neural Machine Translation

论文作者

Fernando, Aloka, Ranathunga, Surangika

论文摘要

Vocabulary（OOV）是神经机器翻译（NMT）的问题。 OOV是指培训数据中出现较低的单词，或者是指培训数据中缺少的单词。为了减轻这一点，已经使用了基于单词或短语的数据增强（DA）技术。但是，现有的DA技术仅解决了这些OOV类型中的一种，并限制了考虑句法约束或语义约束。我们提出了一种基于单词和基于替代的DA技术，该技术通过增强（1）现有平行语料库中的稀有单词来考虑两种类型的OOV，以及（2）双语词典中的新单词。在增强过程中，我们考虑单词的句法和语义特性，以确保合成句子的流利性。该技术是用低资源的Sinhala-English语言对实验的。我们仅在DA中观察到语义约束，结果与考虑句法约束的分数相当，并且对缺乏语言工具支持的低资源语言有利。另外，通过考虑句法和语义约束，可以进一步改善结果。

Out-of-Vocabulary (OOV) is a problem for Neural Machine Translation (NMT). OOV refers to words with a low occurrence in the training data, or to those that are absent from the training data. To alleviate this, word or phrase-based Data Augmentation (DA) techniques have been used. However, existing DA techniques have addressed only one of these OOV types and limit to considering either syntactic constraints or semantic constraints. We present a word and phrase replacement-based DA technique that consider both types of OOV, by augmenting (1) rare words in the existing parallel corpus, and (2) new words from a bilingual dictionary. During augmentation, we consider both syntactic and semantic properties of the words to guarantee fluency in the synthetic sentences. This technique was experimented with low resource Sinhala-English language pair. We observe with only semantic constraints in the DA, the results are comparable with the scores obtained considering syntactic constraints, and is favourable for low-resourced languages that lacks linguistic tool support. Additionally, results can be further improved by considering both syntactic and semantic constraints.

下载PDF全文

下载文献需遵守相关版权规定

论文标题