数据增强以改善软件工程通信中情绪识别

论文标题

数据增强以改善软件工程通信中情绪识别

Data Augmentation for Improving Emotion Recognition in Software Engineering Communication

论文作者

Imran, Mia Mohammad, Jain, Yashasvi, Chatterjee, Preetha, Damevski, Kostadin

论文摘要

情绪（例如，欢乐，愤怒）在日常软件工程（SE）活动中很普遍，并且已知是工作生产力的重要指标（例如，错误固定效率）。最近的研究表明，直接将通用情感分类工具应用于SE Corpora是无效的。即使在SE域内，在一个通信渠道进行训练并在另一个通信渠道进行评估时，工具性能也会显着降低（例如，Stackoverflow vs. Github评论）。用特定于渠道数据的工具重新培训需要大量精力，因为手动注释大型地面真相数据的数据集很昂贵。在本文中，我们通过使用数据增强技术自动创建新的培训数据来解决此数据稀缺问题。基于对流行SE特定情感识别工具造成的错误类型的分析，我们专门针对数据增强策略，以提高情绪识别的性能。我们的结果表明，在接受我们最佳的增强策略培训时，三种现有情感分类工具（ESEM-E，EMTK，Sentimoji）的微F1得分平均提高了9.3％。

Emotions (e.g., Joy, Anger) are prevalent in daily software engineering (SE) activities, and are known to be significant indicators of work productivity (e.g., bug fixing efficiency). Recent studies have shown that directly applying general purpose emotion classification tools to SE corpora is not effective. Even within the SE domain, tool performance degrades significantly when trained on one communication channel and evaluated on another (e.g, StackOverflow vs. GitHub comments). Retraining a tool with channel-specific data takes significant effort since manually annotating large datasets of ground truth data is expensive. In this paper, we address this data scarcity problem by automatically creating new training data using a data augmentation technique. Based on an analysis of the types of errors made by popular SE-specific emotion recognition tools, we specifically target our data augmentation strategy in order to improve the performance of emotion recognition. Our results show an average improvement of 9.3% in micro F1-Score for three existing emotion classification tools (ESEM-E, EMTk, SEntiMoji) when trained with our best augmentation strategy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题