希伯来语形态歧义和变音术修复的新颖挑战集

论文标题

希伯来语形态歧义和变音术修复的新颖挑战集

A Novel Challenge Set for Hebrew Morphological Disambiguation and Diacritics Restoration

论文作者

Shmidman, Avi, Guedalia, Joshua, Shmidman, Shaltiel, Koppel, Moshe, Tsarfaty, Reut

论文摘要

形态解析器的主要任务之一是歧视同符的歧义。歧义不平衡的情况特别困难，其中可能的分析之一比其他分析要频繁。在这种情况下，可能没有足够的少数分析示例来正确评估绩效，也没有培训有效的分类器。在本文中，我们解决了希伯来语中形态歧义不平衡的问题。我们为希伯来语同源物（同类产品中的第一个）提供了挑战，其中包含对21种希伯来同源物的每个分析的实质性证明。我们表明，目前希伯来语歧义的SOTA在歧义不平衡的情况下表现不佳。利用我们的新数据集，我们为所有21个单词实现了新的最先进，将总体F1分数从0.67提高到0.95。我们由此产生的注释数据集可公开用于进一步研究。

One of the primary tasks of morphological parsers is the disambiguation of homographs. Particularly difficult are cases of unbalanced ambiguity, where one of the possible analyses is far more frequent than the others. In such cases, there may not exist sufficient examples of the minority analyses in order to properly evaluate performance, nor to train effective classifiers. In this paper we address the issue of unbalanced morphological ambiguities in Hebrew. We offer a challenge set for Hebrew homographs -- the first of its kind -- containing substantial attestation of each analysis of 21 Hebrew homographs. We show that the current SOTA of Hebrew disambiguation performs poorly on cases of unbalanced ambiguity. Leveraging our new dataset, we achieve a new state-of-the-art for all 21 words, improving the overall average F1 score from 0.67 to 0.95. Our resulting annotated datasets are made publicly available for further research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题