论文标题

低资源的远处监督和嘈杂的标签学习名称实体识别:豪萨和Yorùbá的研究

Distant Supervision and Noisy Label Learning for Low Resource Named Entity Recognition: A Study on Hausa and Yorùbá

论文作者

Adelani, David Ifeoluwa, Hedderich, Michael A., Zhu, Dawei, Berg, Esther van den, Klakow, Dietrich

论文摘要

缺乏标记的培训数据限制了对发展中国家使用的许多语言的自然语言处理工具的开发,例如命名实体识别。诸如遥远和弱监督之类的技术可用于以(半)自动方式创建标记的数据。此外,为了减轻自动注释中错误的某些负面影响,可以集成噪声处理方法。验证的单词嵌入是大多数神经命名实体分类器的另一个关键组成部分。随着更复杂的上下文单词嵌入的出现,出现了模型大小和性能之间的有趣权衡。尽管这些技术已被证明在高资源环境中运行良好,但我们希望研究它们在低资源场景中的表现。在这项工作中,我们对Hausa和Yorùbá进行了名为“实体认可”,这两种语言在几个发展中国家使用。我们评估了不同的嵌入方法,并表明可以在现实的低资源场景中成功利用遥远的监督,在这种情况下,它可以使分类器的性能增加一倍。

The lack of labeled training data has limited the development of natural language processing tools, such as named entity recognition, for many languages spoken in developing countries. Techniques such as distant and weak supervision can be used to create labeled data in a (semi-) automatic way. Additionally, to alleviate some of the negative effects of the errors in automatic annotation, noise-handling methods can be integrated. Pretrained word embeddings are another key component of most neural named entity classifiers. With the advent of more complex contextual word embeddings, an interesting trade-off between model size and performance arises. While these techniques have been shown to work well in high-resource settings, we want to study how they perform in low-resource scenarios. In this work, we perform named entity recognition for Hausa and Yorùbá, two languages that are widely spoken in several developing countries. We evaluate different embedding approaches and show that distant supervision can be successfully leveraged in a realistic low-resource scenario where it can more than double a classifier's performance.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源