论文标题
基于概念的对抗攻击:欺骗人和分类器
Concept-based Adversarial Attacks: Tricking Humans and Classifiers Alike
论文作者
论文摘要
我们建议通过修改编码语义上有意义的概念的上层激活来生成对抗样本。原始样品通过使用修改的激活重建原始样本,将原始样品转移到目标样本,从而产生对抗性样本。人类可能会注意到原始样本和对抗性样本之间的差异。根据攻击者提供的约束,对抗样本可以表现出细微的差异,或者看起来像其他类别的“锻造”样本。我们的方法和目标与涉及人类无法识别的单像素的扰动的常见攻击形成鲜明对比。我们的方法与输入的多阶段处理相关,在该过程中,人类和机器都参与决策,因为无形的扰动不会欺骗人类。我们的评估集中在深度神经网络上。我们还显示了对抗性示例在网络之间的可传递性。
We propose to generate adversarial samples by modifying activations of upper layers encoding semantically meaningful concepts. The original sample is shifted towards a target sample, yielding an adversarial sample, by using the modified activations to reconstruct the original sample. A human might (and possibly should) notice differences between the original and the adversarial sample. Depending on the attacker-provided constraints, an adversarial sample can exhibit subtle differences or appear like a "forged" sample from another class. Our approach and goal are in stark contrast to common attacks involving perturbations of single pixels that are not recognizable by humans. Our approach is relevant in, e.g., multi-stage processing of inputs, where both humans and machines are involved in decision-making because invisible perturbations will not fool a human. Our evaluation focuses on deep neural networks. We also show the transferability of our adversarial examples among networks.