论文标题
Manorm:摩洛哥阿拉伯方言的标准化词典用拉丁文脚本编写
MANorm: A Normalization Dictionary for Moroccan Arabic Dialect Written in Latin Script
论文作者
论文摘要
社交媒体用户生成的文本实际上是许多NLP任务的主要资源。但是,本文不遵循标准写作规则。此外,在书面通信中使用方言(例如摩洛哥阿拉伯语)增加了NLP任务的复杂性。方言是一种口头语言,没有标准拼字法,这会导致用户在写作时即兴拼写。因此,对于相同的词,我们可以找到多种形式的音译。随后,必须将这些不同的音译标准化为一种规范的单词形式。为了实现这一目标,我们利用了用YouTube评论生成的单词嵌入模型的强大性。此外,使用提供规范形式的摩洛哥阿拉伯方言词典,我们构建了一个规范化词典,我们称为Manorm。我们进行了几项实验,以证明Manorm的效率,这些实验表明了其在方言归一化中的有用性。
Social media user-generated text is actually the main resource for many NLP tasks. This text however, does not follow the standard rules of writing. Moreover, the use of dialect such as Moroccan Arabic in written communications increases further NLP tasks complexity. A dialect is a verbal language that does not have a standard orthography, which leads users to improvise spelling while writing. Thus, for the same word we can find multiple forms of transliterations. Subsequently, it is mandatory to normalize these different transliterations to one canonical word form. To reach this goal, we have exploited the powerfulness of word embedding models generated with a corpus of YouTube comments. Besides, using a Moroccan Arabic dialect dictionary that provides the canonical forms, we have built a normalization dictionary that we refer to as MANorm. We have conducted several experiments to demonstrate the efficiency of MANorm, which have shown its usefulness in dialect normalization.