论文标题
非正式罗马化解密的语音和视觉先验
Phonetic and Visual Priors for Decipherment of Informal Romanization
论文作者
论文摘要
非正式的罗马化是人类在非正式数字通信中使用的特质过程,将非拉丁语脚本语言编码为公共键盘上的拉丁字符集。角色替代选择在用户之间有所不同,但已被证明受到各种语言观察到的相同主要原理的控制 - 即,字符对通常是通过语音或视觉相似性关联的。我们提出了一个嘈杂的频道WFST级联模型,用于以无监督的方式从观察到的罗马化文本中解密原始的非拉丁文脚本。我们直接培训模型,以两种语言的罗马化数据进行培训:埃及阿拉伯语和俄语。我们证明,通过语音和视觉先验在角色映射上增加感应偏置可大大提高模型在两种语言上的性能,从而使结果更接近受监督的天际线。最后,我们介绍了一个从俄罗斯社交网站收集的罗马俄罗斯的新数据集,并部分注释了我们的实验。
Informal romanization is an idiosyncratic process used by humans in informal digital communication to encode non-Latin script languages into Latin character sets found on common keyboards. Character substitution choices differ between users but have been shown to be governed by the same main principles observed across a variety of languages---namely, character pairs are often associated through phonetic or visual similarity. We propose a noisy-channel WFST cascade model for deciphering the original non-Latin script from observed romanized text in an unsupervised fashion. We train our model directly on romanized data from two languages: Egyptian Arabic and Russian. We demonstrate that adding inductive bias through phonetic and visual priors on character mappings substantially improves the model's performance on both languages, yielding results much closer to the supervised skyline. Finally, we introduce a new dataset of romanized Russian, collected from a Russian social network website and partially annotated for our experiments.