呼吸语言的语言识别

论文标题

呼吸语言的语言识别

Language Identification for Austronesian Languages

论文作者

Dunn, Jonathan, Nijhof, Wikke

论文摘要

本文为太平洋地区的低资源和资源不足语言提供了语言识别模型，重点是以前无法获得的奥地利语语言。准确的语言识别是开发语言资源的重要组成部分。本文采用的方法结合了29种南方语言和171种非澳洲语言，以创建从八个数据源中绘制的评估集。在评估了六种语言识别方法之后，我们发现基于跳过的分类器的分类器的性能明显高于替代方法。然后，我们系统地将模型中的非澳洲语言的数量增加到总共800种语言，以评估增加语言库存是否会导致对澳洲感兴趣的宗教语言的精确预测。该评估发现，增加非澳洲语言库存造成的准确性只有最小的影响。进一步的实验将这些语言识别模型适应了代码转换检测，从而在所有29种语言中都达到了高精度。

This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages. Accurate language identification is an important part of developing language resources. The approach taken in this paper combines 29 Austronesian languages with 171 non-Austronesian languages to create an evaluation set drawn from eight data sources. After evaluating six approaches to language identification, we find that a classifier based on skip-gram embeddings reaches a significantly higher performance than alternate methods. We then systematically increase the number of non-Austronesian languages in the model up to a total of 800 languages to evaluate whether an increased language inventory leads to less precise predictions for the Austronesian languages of interest. This evaluation finds that there is only a minimal impact on accuracy caused by increasing the inventory of non-Austronesian languages. Further experiments adapt these language identification models for code-switching detection, achieving high accuracy across all 29 languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题