Naamapadam：指示语言的大规模命名实体注释数据

论文标题

Naamapadam：指示语言的大规模命名实体注释数据

Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

论文作者

Mhaske, Arnav, Kedia, Harshit, Doddapaneni, Sumanth, Khapra, Mitesh M., Kumar, Pratyush, Murthy V, Rudra, Kunchukuttan, Anoop

论文摘要

我们介绍了Naamapadam，是两个语言家族的11种主要印度语言的最大公开命名实体识别（NER）数据集。该数据集包含超过400k句子注释的句子，其中包含来自三个标准实体类别（人，位置和组织）的总共至少100K实体，其中9种语言中有9种。培训数据集是通过从英语句子到相应的印度语言翻译自动标记的实体来自动创建的。我们还为9种语言创建手动注释的测试集。我们演示了Naamapadam检测数据集上获得的数据集的实用性。我们还发布了Indifner，这是一种在Naamapadam培训集中微调的多语言Indienbert模型。 Invidner在$ 9 $测试语言中获得的F1分数超过80美元，$ 7 $。该数据集和模型可在开源许可下获得https://ai4bharat.iitm.ac.in/naamapadam。

We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location, and, Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language translation. We also create manually annotated testsets for 9 languages. We demonstrate the utility of the obtained dataset on the Naamapadam-test dataset. We also release IndicNER, a multilingual IndicBERT model fine-tuned on Naamapadam training set. IndicNER achieves an F1 score of more than $80$ for $7$ out of $9$ test languages. The dataset and models are available under open-source licences at https://ai4bharat.iitm.ac.in/naamapadam.

下载PDF全文

下载文献需遵守相关版权规定

论文标题