odia语言的通用依赖树库

论文标题

odia语言的通用依赖树库

Universal Dependency Treebank for Odia Language

论文作者

Parida, Shantipriya, Sahoo, Kalyanamalini, Ojha, Atul Kr., Sahoo, Saraswati, Dash, Satya Ranjan, Dash, Bijayalaxmi

论文摘要

本文介绍了Odia的第一个公开可用的Treebank，这是一种形态上丰富的低资源印度语言。 Treebank大约包含。 1082代币（100个句子），从“ samantar”中选出的ODIA，这是指示语言最大的可用并行语料库集合。 All the selected sentences are manually annotated following the ``Universal Dependency (UD)" guidelines. The morphological analysis of the Odia treebank was performed using machine learning techniques. The Odia annotated treebank will enrich the Odia language resource and will help in building language technology tools for cross-lingual learning and typological research. We also build a preliminary Odia parser using a machine learning approach. The accuracy of the parser is 86.6％的令牌化，64.1％的UPO，63.78％的XPO，42.04％的UAS和21.34％的LAS。

This paper presents the first publicly available treebank of Odia, a morphologically rich low resource Indian language. The treebank contains approx. 1082 tokens (100 sentences) in Odia selected from "Samantar", the largest available parallel corpora collection for Indic languages. All the selected sentences are manually annotated following the ``Universal Dependency (UD)" guidelines. The morphological analysis of the Odia treebank was performed using machine learning techniques. The Odia annotated treebank will enrich the Odia language resource and will help in building language technology tools for cross-lingual learning and typological research. We also build a preliminary Odia parser using a machine learning approach. The accuracy of the parser is 86.6% Tokenization, 64.1% UPOS, 63.78% XPOS, 42.04% UAS and 21.34% LAS. Finally, the paper briefly discusses the linguistic analysis of the Odia UD treebank.

下载PDF全文

下载文献需遵守相关版权规定

论文标题