论文标题
在构建无监督的双语单词嵌入之前,请不要忘记廉价的培训信号
Don't Forget Cheap Training Signals Before Building Unsupervised Bilingual Word Embeddings
论文作者
论文摘要
双语单词嵌入(BWES)是NLP模型的跨语性转移的基石之一。它们只能在没有监督的情况下仅使用单语语料库来构建,从而导致无数件专注于无监督的BWES的作品。但是,当前大多数构建无监督的BWE的方法并未将其结果与基于易于访问的跨语性信号的方法进行比较。在本文中,我们认为在开发无监督的BWE方法时,应始终考虑此类信号。我们发现最有效的两种方法是:1)使用与种子词典相同的单词(无监督的方法不正确地假设是拼字上不同的语言对不可用的)和2)将此类词典与通过与编辑距离阈值相匹配的罗马化版本的对匹配的romanized Sorge匹配的对组合。我们在13种非拉丁语语言(和英语)上进行了实验,并表明这种廉价的信号效果很好,并且它们在远处的语言对中使用更复杂的无监督方法(例如中文,日语,卡纳达语,泰米尔语和泰语)都优于球。此外,他们甚至在监督方法中使用高质量词典的竞争力。我们的结果表明,即使是遥远的语言,也不应忽略这些培训信号。
Bilingual Word Embeddings (BWEs) are one of the cornerstones of cross-lingual transfer of NLP models. They can be built using only monolingual corpora without supervision leading to numerous works focusing on unsupervised BWEs. However, most of the current approaches to build unsupervised BWEs do not compare their results with methods based on easy-to-access cross-lingual signals. In this paper, we argue that such signals should always be considered when developing unsupervised BWE methods. The two approaches we find most effective are: 1) using identical words as seed lexicons (which unsupervised approaches incorrectly assume are not available for orthographically distinct language pairs) and 2) combining such lexicons with pairs extracted by matching romanized versions of words with an edit distance threshold. We experiment on thirteen non-Latin languages (and English) and show that such cheap signals work well and that they outperform using more complex unsupervised methods on distant language pairs such as Chinese, Japanese, Kannada, Tamil, and Thai. In addition, they are even competitive with the use of high-quality lexicons in supervised approaches. Our results show that these training signals should not be neglected when building BWEs, even for distant languages.