Epida：高性能文本分类的简单插件数据增强框架

论文标题

Epida：高性能文本分类的简单插件数据增强框架

EPiDA: An Easy Plug-in Data Augmentation Framework for High Performance Text Classification

论文作者

Zhao, Minyi, Zhang, Lu, Xu, Yi, Ding, Jiandong, Guan, Jihong, Zhou, Shuigeng

论文摘要

最近的作品在经验上表明了数据增强（DA）在NLP任务中的有效性，尤其是对于患有数据稀缺的人。鉴于生成的数据的大小，它们的多样性和质量对于目标任务的执行至关重要。但是，据我们所知，大多数现有方法仅考虑增强数据的多样性或质量，因此无法完全挖掘DA对NLP的潜力。在本文中，我们提出了一个简单而插入的数据增强框架EPIDA，以支持有效的文本分类。 Epida采用两种机制：相对熵最大化（REM）和条件熵最小化（CEM）来控制数据生成，其中REM旨在增强增强数据的多样性，同时利用CEM来确保其语义一致性。 EPIDA可以支持有效的分类器培训的有效且连续的数据生成。广泛的实验表明，在大多数情况下，EPIDA在大多数情况下都优于现有的SOTA方法，尽管不使用任何代理网络或预训练的生成网络，并且与各种DA算法和分类模型都可以很好地运行。代码可在https://github.com/zhaominyiz/epida上找到。

Recent works have empirically shown the effectiveness of data augmentation (DA) in NLP tasks, especially for those suffering from data scarcity. Intuitively, given the size of generated data, their diversity and quality are crucial to the performance of targeted tasks. However, to the best of our knowledge, most existing methods consider only either the diversity or the quality of augmented data, thus cannot fully mine the potential of DA for NLP. In this paper, we present an easy and plug-in data augmentation framework EPiDA to support effective text classification. EPiDA employs two mechanisms: relative entropy maximization (REM) and conditional entropy minimization (CEM) to control data generation, where REM is designed to enhance the diversity of augmented data while CEM is exploited to ensure their semantic consistency. EPiDA can support efficient and continuous data generation for effective classifier training. Extensive experiments show that EPiDA outperforms existing SOTA methods in most cases, though not using any agent networks or pre-trained generation networks, and it works well with various DA algorithms and classification models. Code is available at https://github.com/zhaominyiz/EPiDA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题