在卡纳达语 - 英语文本中，在单词级别上使用代码混合语言识别的大肠杆菌学习方法

论文标题

在卡纳达语 - 英语文本中，在单词级别上使用代码混合语言识别的大肠杆菌学习方法

CoLI-Machine Learning Approaches for Code-mixed Language Identification at the Word Level in Kannada-English Texts

论文作者

Shashirekha, H. L., Balouchzahi, F., Anusha, M. D., Sidorov, G.

论文摘要

自动识别给定文本中使用的语言的任务称为语言识别（LI）。印度是一个多语言的国家，许多印度人尤其是年轻人对印地语和英语都满意，除了当地语言。因此，他们经常使用多种语言在社交媒体上发表评论。包含多种语言的文本称为“代码混合文本”，是LI的良好输入来源。这些文本中的语言可能会在句子级别，单词级别甚至子字级别混合在一起。 li在单词级别是一个序列标记问题，其中句子中的每个单词都用预定义的语言集中的一种语言标记。为了解决词级li中的li，用代码混合的kannada-english（kn-en）文本解决了这项工作i）i）构建称为coli-kenglish数据集的代码混合的kn-en数据集，ii）使用机器学习（ML），深度学习（DL）和传输（DL）和移植（TL）方法。代码混合的KN-EN文本是从Kannada YouTube视频注释中提取的，以构建Coli-Kenglish数据集和代码混合的KN-EN嵌入。 Coli-kenglish数据集中的单词分为六个主要类别，即“ Kannada”，“英语”，“混合语言”，“名称”，“位置”和“其他”。使用Coli-kenglish数据集构建和评估了基于DL和基于TL方法的Coli-ulmfit的Coli-Bilstm，基于ML的Coli-Vector和Coli-Bilstm的学习模型，Coli-Bilstm。与其他宏平均F1得分为0.64的模型相比，学习模型的性能是大肠杆菌模型的优越性。但是，所有学习模型的结果相互竞争。

The task of automatically identifying a language used in a given text is called Language Identification (LI). India is a multilingual country and many Indians especially youths are comfortable with Hindi and English, in addition to their local languages. Hence, they often use more than one language to post their comments on social media. Texts containing more than one language are called "code-mixed texts" and are a good source of input for LI. Languages in these texts may be mixed at sentence level, word level or even at sub-word level. LI at word level is a sequence labeling problem where each and every word in a sentence is tagged with one of the languages in the predefined set of languages. In order to address word level LI in code-mixed Kannada-English (Kn-En) texts, this work presents i) the construction of code-mixed Kn-En dataset called CoLI-Kenglish dataset, ii) code-mixed Kn-En embedding and iii) learning models using Machine Learning (ML), Deep Learning (DL) and Transfer Learning (TL) approaches. Code-mixed Kn-En texts are extracted from Kannada YouTube video comments to construct CoLI-Kenglish dataset and code-mixed Kn-En embedding. The words in CoLI-Kenglish dataset are grouped into six major categories, namely, "Kannada", "English", "Mixed-language", "Name", "Location" and "Other". The learning models, namely, CoLI-vectors and CoLI-ngrams based on ML, CoLI-BiLSTM based on DL and CoLI-ULMFiT based on TL approaches are built and evaluated using CoLI-Kenglish dataset. The performances of the learning models illustrated, the superiority of CoLI-ngrams model, compared to other models with a macro average F1-score of 0.64. However, the results of all the learning models were quite competitive with each other.

下载PDF全文

下载文献需遵守相关版权规定

论文标题