在文本分类中理解，检测和分离分发样本和对抗样本

论文标题

在文本分类中理解，检测和分离分发样本和对抗样本

Understanding, Detecting, and Separating Out-of-Distribution Samples and Adversarial Samples in Text Classification

论文作者

Chiang, Cheng-Han, Lee, Hung-yi

论文摘要

在本文中，我们研究了统计上分布（OOD）样本和对抗（ADV）样本之间的差异和共同点，这两者都损害了文本分类模型的性能。我们进行了分析，以将两种类型的异常（OOD和ADV样本）与三个方面的分布（ID）（ID）进行比较：输入特征，模型每一层中的隐藏表示形式以及分类器的输出概率分布。我们发现，OOD样品从第一层开始暴露它们的像差，而直到模型的较深层，ADV样品的异常才出现。我们还说明，ADV样品的模型输出概率往往更不自信。基于我们的观察结果，我们建议使用模型的隐藏表示形式和输出概率分离ID，OOD和ADV样本的简单方法。在ID，OOD数据集和ADV攻击的多种组合上，我们提出的方法在区分ID，OOD和ADV样本方面显示了出色的结果。

In this paper, we study the differences and commonalities between statistically out-of-distribution (OOD) samples and adversarial (Adv) samples, both of which hurting a text classification model's performance. We conduct analyses to compare the two types of anomalies (OOD and Adv samples) with the in-distribution (ID) ones from three aspects: the input features, the hidden representations in each layer of the model, and the output probability distributions of the classifier. We find that OOD samples expose their aberration starting from the first layer, while the abnormalities of Adv samples do not emerge until the deeper layers of the model. We also illustrate that the models' output probabilities for Adv samples tend to be more unconfident. Based on our observations, we propose a simple method to separate ID, OOD, and Adv samples using the hidden representations and output probabilities of the model. On multiple combinations of ID, OOD datasets, and Adv attacks, our proposed method shows exceptional results on distinguishing ID, OOD, and Adv samples.

下载PDF全文

下载文献需遵守相关版权规定

论文标题