在对抗扰动攻击下认证在加固学习中的安全性

论文标题

在对抗扰动攻击下认证在加固学习中的安全性

Certifying Safety in Reinforcement Learning under Adversarial Perturbation Attacks

论文作者

Wu, Junlin, Sibai, Hussein, Vorobeychik, Yevgeniy

论文摘要

功能近似已在以高维输入（例如图像，端到端的方式）中应用增强学习（RL）技术方面的显着进步，以端到端的方式将此类输入直接映射到低级控制中。然而，事实证明，这些容易受到小小的对抗输入扰动的影响。结果出现了许多改善或证明端到端RL对对抗性扰动的鲁棒性的方法，重点是累积奖励。但是，在对抗场景中通常受到威胁的是违反基本属性，例如安全性，而不是将安全性与效率结合起来的整体奖励。此外，诸如安全之类的属性只能针对真实状态定义，而不是端到端政策的高维原始输入。为了确定名义效率和对抗性安全，我们将RL定位在确定性的部分观察到的马尔可夫决策过程（POMDP）中，目的是最大化累积奖励，但要受安全约束。然后，我们提出了一个部分监督的强化学习（PSRL）框架，该框架利用了一个额外的假设，即POMDP的真实状态在训练时已知。我们提出了在对抗输入扰动下证明PSRL政策安全性的第一种方法，以及两种直接使用PSRL的对抗训练方法。我们的实验既证明了在对抗环境中证明安全性的拟议方法的功效，以及PSRL框架的价值以及对抗训练在改善认证安全性方面，同时保留高标称奖励和对真实状态的高质量预测。

Function approximation has enabled remarkable advances in applying reinforcement learning (RL) techniques in environments with high-dimensional inputs, such as images, in an end-to-end fashion, mapping such inputs directly to low-level control. Nevertheless, these have proved vulnerable to small adversarial input perturbations. A number of approaches for improving or certifying robustness of end-to-end RL to adversarial perturbations have emerged as a result, focusing on cumulative reward. However, what is often at stake in adversarial scenarios is the violation of fundamental properties, such as safety, rather than the overall reward that combines safety with efficiency. Moreover, properties such as safety can only be defined with respect to true state, rather than the high-dimensional raw inputs to end-to-end policies. To disentangle nominal efficiency and adversarial safety, we situate RL in deterministic partially-observable Markov decision processes (POMDPs) with the goal of maximizing cumulative reward subject to safety constraints. We then propose a partially-supervised reinforcement learning (PSRL) framework that takes advantage of an additional assumption that the true state of the POMDP is known at training time. We present the first approach for certifying safety of PSRL policies under adversarial input perturbations, and two adversarial training approaches that make direct use of PSRL. Our experiments demonstrate both the efficacy of the proposed approach for certifying safety in adversarial environments, and the value of the PSRL framework coupled with adversarial training in improving certified safety while preserving high nominal reward and high-quality predictions of true state.

下载PDF全文

下载文献需遵守相关版权规定

论文标题