论文标题
在低资源设置中,与COVID-19与尼泊尔相关的Nepali Tweets分类
COVID-19-related Nepali Tweets Classification in a Low Resource Setting
论文作者
论文摘要
全球数十亿人一直在使用其当地语言的社交媒体平台来表达他们对与19日大流行有关的各种主题的看法。包括世界卫生组织在内的几个组织已经开发了自动化的社交媒体分析工具,这些工具将与COVID相关的推文分类为各种主题。但是,这些有助于打击大流行的工具仅限于很少的语言,这使得几个国家无法利用他们的利益。尽管正在开发多语言或低资源语言特定的工具,但他们仍然需要扩大其覆盖范围,例如尼泊尔语言。在本文中,我们使用尼泊尔语言确定了Twitter社区中最常见的八个最常见的Covid-19讨论主题,建立了一个在线平台,以自动收集包含COVID-19与COVID-19相关的关键字的Nepali Tweets,将推文分类为八个主题,并在基于Web的Dashboard中可视化结果。我们比较了两种用于尼泊尔推文分类的最先进的多语言语言模型,一个通用(Mbert)和另一个尼泊尔语言家庭特定模型(Muril)。我们的结果表明,模型的相对性能取决于数据大小,而Muril对较大的数据集做得更好。带注释的数据,模型和基于Web的仪表板通过https://github.com/naamiinepal/covid-tweet-classification进行开源。
Billions of people across the globe have been using social media platforms in their local languages to voice their opinions about the various topics related to the COVID-19 pandemic. Several organizations, including the World Health Organization, have developed automated social media analysis tools that classify COVID-19-related tweets into various topics. However, these tools that help combat the pandemic are limited to very few languages, making several countries unable to take their benefit. While multi-lingual or low-resource language-specific tools are being developed, they still need to expand their coverage, such as for the Nepali language. In this paper, we identify the eight most common COVID-19 discussion topics among the Twitter community using the Nepali language, set up an online platform to automatically gather Nepali tweets containing the COVID-19-related keywords, classify the tweets into the eight topics, and visualize the results across the period in a web-based dashboard. We compare the performance of two state-of-the-art multi-lingual language models for Nepali tweet classification, one generic (mBERT) and the other Nepali language family-specific model (MuRIL). Our results show that the models' relative performance depends on the data size, with MuRIL doing better for a larger dataset. The annotated data, models, and the web-based dashboard are open-sourced at https://github.com/naamiinepal/covid-tweet-classification.