论文标题
Indicnlg基准:用于指示语言的多种语言数据集
IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages
论文作者
论文摘要
这些语言中数据集的稀缺性阻碍了非英语语言的自然语言生成(NLG)。在本文中,我们介绍了IndiNNLG基准,该基准集合用于为11个指示语言制定NLG的数据集。我们专注于五项不同的任务,即使用Wikipedia Infoboxes,新闻标题生成,句子摘要,释义生成以及问题产生的传记生成。我们描述了创建的数据集,并使用它们来基于利用预训练的序列到序列模型的几种单语和多语言基线的性能。我们的结果表现出多语言特定语言的预训练模型的强劲性能,以及在我们的数据集中训练其他相关NLG任务的模型的实用性。我们的数据集创建方法可以轻松地应用于适度的资源语言,因为它们涉及简单的步骤,例如刮擦新闻文章和Wikipedia Infoboxes,轻型清洁以及通过机器翻译数据枢转。据我们所知,INDICNLG基准是指示语言和最多样化的多语言NLG数据集的第一个NLG基准,其中大约有5个任务和11种语言的示例。数据集和模型可在https://ai4bharat.iitm.ac.in/indicnlg-suite上公开获得。
Natural Language Generation (NLG) for non-English languages is hampered by the scarcity of datasets in these languages. In this paper, we present the IndicNLG Benchmark, a collection of datasets for benchmarking NLG for 11 Indic languages. We focus on five diverse tasks, namely, biography generation using Wikipedia infoboxes, news headline generation, sentence summarization, paraphrase generation and, question generation. We describe the created datasets and use them to benchmark the performance of several monolingual and multilingual baselines that leverage pre-trained sequence-to-sequence models. Our results exhibit the strong performance of multilingual language-specific pre-trained models, and the utility of models trained on our dataset for other related NLG tasks. Our dataset creation methods can be easily applied to modest-resource languages as they involve simple steps such as scraping news articles and Wikipedia infoboxes, light cleaning, and pivoting through machine translation data. To the best of our knowledge, the IndicNLG Benchmark is the first NLG benchmark for Indic languages and the most diverse multilingual NLG dataset, with approximately 8M examples across 5 tasks and 11 languages. The datasets and models are publicly available at https://ai4bharat.iitm.ac.in/indicnlg-suite.