通过安全编辑政策进行安全加强学习

论文标题

通过安全编辑政策进行安全加强学习

Towards Safe Reinforcement Learning with a Safety Editor Policy

论文作者

Yu, Haonan, Xu, Wei, Zhang, Haichao

论文摘要

我们认为安全加强学习（RL）的问题是最大程度地利用了极低的约束违规率。假设没有提交任务的环境安全模型的先验知识或预培训，则代理必须通过探索来学习哪些国家和行动是安全的。在这一研究中，一种流行的方法是将无模型的RL算法与拉格朗日方法相结合，以动态地调整约束奖励的重量。它依靠单个政策来处理效用和约束奖励之间的冲突，这通常是具有挑战性的。我们提出了Seditor，这是一种两极的方法，它了解了安全编辑策略，将实用程序最大化器政策提出的潜在不安全行动转变为安全方案。对安全编辑器进行了培训，以最大程度地提高约束奖励，同时最大程度地减少了在编辑操作之前和之后的公用事业状态行动值的铰链损失。 Seditor扩展了假设简化安全模型的现有安全层设计，并将安全模型在理论上可以任意复杂的一般安全RL方案。作为一阶方法，对于推理和培训，它易于实施且有效。在12项安全健身任务和2项安全赛车任务中，Seditor获得的总体安全加权效果（SWU）得分要高于基线，并且在每2K时步，即使在障碍物致密的环境中，也表现出出色的效用性能，其限制性违规率与每2K时间步长一次。在某些任务上，这种低违规率的速度比具有类似实用性性能的无约束RL方法低200倍。代码可在https://github.com/hnyu/seditor上找到。

We consider the safe reinforcement learning (RL) problem of maximizing utility with extremely low constraint violation rates. Assuming no prior knowledge or pre-training of the environment safety model given a task, an agent has to learn, via exploration, which states and actions are safe. A popular approach in this line of research is to combine a model-free RL algorithm with the Lagrangian method to adjust the weight of the constraint reward relative to the utility reward dynamically. It relies on a single policy to handle the conflict between utility and constraint rewards, which is often challenging. We present SEditor, a two-policy approach that learns a safety editor policy transforming potentially unsafe actions proposed by a utility maximizer policy into safe ones. The safety editor is trained to maximize the constraint reward while minimizing a hinge loss of the utility state-action values before and after an action is edited. SEditor extends existing safety layer designs that assume simplified safety models, to general safe RL scenarios where the safety model can in theory be arbitrarily complex. As a first-order method, it is easy to implement and efficient for both inference and training. On 12 Safety Gym tasks and 2 safe racing tasks, SEditor obtains much a higher overall safety-weighted-utility (SWU) score than the baselines, and demonstrates outstanding utility performance with constraint violation rates as low as once per 2k time steps, even in obstacle-dense environments. On some tasks, this low violation rate is up to 200 times lower than that of an unconstrained RL method with similar utility performance. Code is available at https://github.com/hnyu/seditor.

下载PDF全文

下载文献需遵守相关版权规定

论文标题