Aksharantar：接下来的十亿用户的开放指示音译数据集和模型

论文标题

Aksharantar：接下来的十亿用户的开放指示音译数据集和模型

Aksharantar: Open Indic-language Transliteration datasets and models for the Next Billion Users

论文作者

Madhani, Yash, Parthan, Sushane, Bedekar, Priyanka, NC, Gokul, Khapra, Ruchi, Kunchukuttan, Anoop, Kumar, Pratyush, Khapra, Mitesh M.

论文摘要

由于多个脚本的使用以及广泛使用罗马化的输入，在印度语言环境中，音译非常重要。但是，很少有培训和评估集公开可用。我们介绍了Aksharantar，这是通过单语和平行语料库采矿创建的印度语言最大的公开音译数据集，并从人类注释者那里收集数据。该数据集使用12个脚本中包含来自3个语言系列的21种指示语言的2600万个音译对。 Aksharantar是现有数据集的21倍，是第一个用于7种语言和1个语言家族的公开数据集。我们还介绍了包括103K单词对的Aksharantar测试集，涵盖了19种语言，可以对本地起源单词，外语，频繁的单词和稀有单词进行细化分析。使用训练集，我们训练了IndiNxlit，这是一种多语言音译模型，在Dakshina测试集中将精度提高了15％，并在此工作中引入的Aksharantar测试集上建立了强大的基准。 https://github.com/ai4bharat/indicxlit在Open-Source许可下可用。我们希望这些大规模，开放资源的可用性能够刺激Inding语言音译和下游应用程序的创新。我们希望这些大规模，开放资源的可用性能够刺激Inding语言音译和下游应用程序的创新。

Transliteration is very important in the Indian language context due to the usage of multiple scripts and the widespread use of romanized inputs. However, few training and evaluation sets are publicly available. We introduce Aksharantar, the largest publicly available transliteration dataset for Indian languages created by mining from monolingual and parallel corpora, as well as collecting data from human annotators. The dataset contains 26 million transliteration pairs for 21 Indic languages from 3 language families using 12 scripts. Aksharantar is 21 times larger than existing datasets and is the first publicly available dataset for 7 languages and 1 language family. We also introduce the Aksharantar testset comprising 103k word pairs spanning 19 languages that enables a fine-grained analysis of transliteration models on native origin words, foreign words, frequent words, and rare words. Using the training set, we trained IndicXlit, a multilingual transliteration model that improves accuracy by 15% on the Dakshina test set, and establishes strong baselines on the Aksharantar testset introduced in this work. The models, mining scripts, transliteration guidelines, and datasets are available at https://github.com/AI4Bharat/IndicXlit under open-source licenses. We hope the availability of these large-scale, open resources will spur innovation for Indic language transliteration and downstream applications. We hope the availability of these large-scale, open resources will spur innovation for Indic language transliteration and downstream applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题