论文标题
南亚的计算历史语言学和语言多样性
Computational historical linguistics and language diversity in South Asia
论文作者
论文摘要
南亚是大量语言的所在地,其中许多语言严重缺乏获得新的语言技术的访问。这种语言多样性还导致研究环境有利于比较,接触和历史语言学的研究 - 需要从多种语言中收集大量数据。我们声称数据散射(而不是稀缺)是南亚语言技术发展的主要障碍,并暗示对语言历史的研究与克服这一障碍的距离独特地保持一致。我们回顾了南亚NLP和历史繁殖语言学的交汇处的最新发展,描述了我们和其他人在这一领域的当前努力。我们还提供了打破数据障碍的新策略。
South Asia is home to a plethora of languages, many of which severely lack access to new language technologies. This linguistic diversity also results in a research environment conducive to the study of comparative, contact, and historical linguistics -- fields which necessitate the gathering of extensive data from many languages. We claim that data scatteredness (rather than scarcity) is the primary obstacle in the development of South Asian language technology, and suggest that the study of language history is uniquely aligned with surmounting this obstacle. We review recent developments in and at the intersection of South Asian NLP and historical-comparative linguistics, describing our and others' current efforts in this area. We also offer new strategies towards breaking the data barrier.