使用重新归一化的影响估计来识别训练集攻击的目标

论文标题

使用重新归一化的影响估计来识别训练集攻击的目标

Identifying a Training-Set Attack's Target Using Renormalized Influence Estimation

论文作者

Hammoudeh, Zayd, Lowd, Daniel

论文摘要

有针对性的训练集攻击将恶意实例注入训练集中，以导致训练有素的模型错误一个或多个特定的测试实例。这项工作提出了目标识别的任务，该任务决定了特定的测试实例是否是训练集攻击的目标。目标识别可以与对抗性识别相结合，以查找（并删除）攻击实例，从而减轻对其他预测的影响，从而减轻攻击。我们不是专注于单个攻击方法或数据模式，而是基于影响估计，这量化了每个培训实例对模型预测的贡献。我们表明，现有的影响估算员的实际表现不佳，通常来自于对训练实例的过度依赖和损失巨大的迭代。我们重新归一化的影响估计器解决了这一弱点。他们远远超过原始估计量，可以在对抗和非对抗环境中识别有影响力的训练示例群体，甚至在没有清洁数据误报的情况下发现多达100％的对抗训练实例。然后，目标识别简化以检测异常影响值的测试实例。我们证明了我们的方法对各种数据域的后门和中毒攻击的有效性，包括文本，视觉和语音，以及针对灰色盒子的自适应攻击者，该攻击者专门优化了逃避我们方法的对抗性实例。我们的源代码可在https://github.com/zaydh/target_istinefication中找到。

Targeted training-set attacks inject malicious instances into the training set to cause a trained model to mislabel one or more specific test instances. This work proposes the task of target identification, which determines whether a specific test instance is the target of a training-set attack. Target identification can be combined with adversarial-instance identification to find (and remove) the attack instances, mitigating the attack with minimal impact on other predictions. Rather than focusing on a single attack method or data modality, we build on influence estimation, which quantifies each training instance's contribution to a model's prediction. We show that existing influence estimators' poor practical performance often derives from their over-reliance on training instances and iterations with large losses. Our renormalized influence estimators fix this weakness; they far outperform the original estimators at identifying influential groups of training examples in both adversarial and non-adversarial settings, even finding up to 100% of adversarial training instances with no clean-data false positives. Target identification then simplifies to detecting test instances with anomalous influence values. We demonstrate our method's effectiveness on backdoor and poisoning attacks across various data domains, including text, vision, and speech, as well as against a gray-box, adaptive attacker that specifically optimizes the adversarial instances to evade our method. Our source code is available at https://github.com/ZaydH/target_identification.

下载PDF全文

下载文献需遵守相关版权规定

论文标题