用代码混合泰米尔语 - 英语文本中的情感分析创建语料库

论文标题

用代码混合泰米尔语 - 英语文本中的情感分析创建语料库

Corpus Creation for Sentiment Analysis in Code-Mixed Tamil-English Text

论文作者

Chakravarthi, Bharathi Raja, Muralidaran, Vigneshwaran, Priyadharshini, Ruba, McCrae, John P.

论文摘要

在许多应用程序中，了解视频或图像的评论的情感是必不可少的任务。文本的情感分析对于各种决策过程可能很有用。一种这样的应用程序是根据观众评论在社交媒体上分析视频的流行情感。但是，社交媒体的评论并不遵循严格的语法规则，它们包含多种语言的混合，通常用非本地脚本编写。像泰米尔语这样的低资源语言的带注释的代码混合数据的不可用也为此问题增加了困难。为了克服这一点，我们创建了一个金标准的泰米尔语 - 英语密码开关，情感宣布的语料库，其中包含YouTube的15,744个评论帖子。在本文中，我们描述了创建语料库和分配极性的过程。我们提出了通知者的协议，并显示了对该语料库作为基准的培训的情感分析结果。

Understanding the sentiment of a comment from a video or an image is an essential task in many applications. Sentiment analysis of a text can be useful for various decision-making processes. One such application is to analyse the popular sentiments of videos on social media based on viewer comments. However, comments from social media do not follow strict rules of grammar, and they contain mixing of more than one language, often written in non-native scripts. Non-availability of annotated code-mixed data for a low-resourced language like Tamil also adds difficulty to this problem. To overcome this, we created a gold standard Tamil-English code-switched, sentiment-annotated corpus containing 15,744 comment posts from YouTube. In this paper, we describe the process of creating the corpus and assigning polarities. We present inter-annotator agreement and show the results of sentiment analysis trained on this corpus as a benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题