在文本分类器中的愚弄解释

论文标题

在文本分类器中的愚弄解释

Fooling Explanations in Text Classifiers

论文作者

Ivankay, Adam, Girardi, Ivan, Marchiori, Chiara, Frossard, Pascal

论文摘要

最先进的文本分类模型越来越依赖深度神经网络（DNNS）。由于其黑盒的性质，忠实而强大的解释方法需要陪同分类器在现实生活中进行部署。但是，在视力应用中已经显示出解释方法对局部，不可察觉的扰动敏感，这些方法可以显着改变解释而不会改变预测类别。我们在这里表明，这种扰动的存在也扩展到文本分类器。具体而言，我们介绍了一种新颖的解释攻击算法，它介绍了一种新颖的解释攻击算法，该算法不可概要地改变文本输入样本，以使广泛使用的解释方法的结果发生了很大变化，而在使分类器预测不变的情况下。我们在五个序列分类数据集上评估了TEF归因鲁棒性估计性能的性能，并利用每个数据集的三个DNN架构和三个变压器体系结构。 TEF可以显着降低未改变和扰动输入归因之间的相关性，这表明所有模型和解释方法都易受TEF扰动的影响。此外，我们评估了扰动如何转移到其他模型架构和归因方法，并表明TEF扰动在目标模型和解释方法未知的情况下也有效。最后，我们引入了一种半世界攻击，能够在不了解受攻击的分类器和解释方法的情况下计算快速，计算轻度扰动。总体而言，我们的工作表明，文本分类器中的解释非常脆弱，用户需要在依靠关键应用程序中依靠它们之前仔细解决其鲁棒性。

State-of-the-art text classification models are becoming increasingly reliant on deep neural networks (DNNs). Due to their black-box nature, faithful and robust explanation methods need to accompany classifiers for deployment in real-life scenarios. However, it has been shown in vision applications that explanation methods are susceptible to local, imperceptible perturbations that can significantly alter the explanations without changing the predicted classes. We show here that the existence of such perturbations extends to text classifiers as well. Specifically, we introduceTextExplanationFooler (TEF), a novel explanation attack algorithm that alters text input samples imperceptibly so that the outcome of widely-used explanation methods changes considerably while leaving classifier predictions unchanged. We evaluate the performance of the attribution robustness estimation performance in TEF on five sequence classification datasets, utilizing three DNN architectures and three transformer architectures for each dataset. TEF can significantly decrease the correlation between unchanged and perturbed input attributions, which shows that all models and explanation methods are susceptible to TEF perturbations. Moreover, we evaluate how the perturbations transfer to other model architectures and attribution methods, and show that TEF perturbations are also effective in scenarios where the target model and explanation method are unknown. Finally, we introduce a semi-universal attack that is able to compute fast, computationally light perturbations with no knowledge of the attacked classifier nor explanation method. Overall, our work shows that explanations in text classifiers are very fragile and users need to carefully address their robustness before relying on them in critical applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题