论文标题
通过预测语言模式,语言不可思议的代码混合数据扩展
Language Agnostic Code-Mixing Data Augmentation by Predicting Linguistic Patterns
论文作者
论文摘要
在这项工作中,我们专注于内部代码混合,并提出了几种不同的合成代码混合(SCM)数据增强方法,这些方法的表现优于下游情绪分析任务的基线,跨各种标记的金数据。最重要的是,我们提出的方法表明,用矩阵语言以持续的掩码替换句子的一部分可以显着提高分类准确性,从而激发对代码混合现象的进一步语言见解。我们在各种低资源和跨语言设置中测试了数据增强方法,在极其稀缺的英语 - 马拉雅拉姆语数据集中,相对提高了7.73%。我们得出的结论是,代码混合句子中的代码切换模式对于模型学习也很重要。最后,我们提出了一种语言不合时宜的SCM算法,该算法便宜但对低资源语言非常有帮助。
In this work, we focus on intrasentential code-mixing and propose several different Synthetic Code-Mixing (SCM) data augmentation methods that outperform the baseline on downstream sentiment analysis tasks across various amounts of labeled gold data. Most importantly, our proposed methods demonstrate that strategically replacing parts of sentences in the matrix language with a constant mask significantly improves classification accuracy, motivating further linguistic insights into the phenomenon of code-mixing. We test our data augmentation method in a variety of low-resource and cross-lingual settings, reaching up to a relative improvement of 7.73% on the extremely scarce English-Malayalam dataset. We conclude that the code-switch pattern in code-mixing sentences is also important for the model to learn. Finally, we propose a language-agnostic SCM algorithm that is cheap yet extremely helpful for low-resource languages.