论文标题
指导安全探索最弱的先决条件
Guiding Safe Exploration with Weakest Preconditions
论文作者
论文摘要
在对关键安全环境的强化学习中,代理通常希望在所有时间点(包括培训期间)服从安全性限制。我们提出了一种新型的神经符号方法,称为Spice来解决这个安全的探索问题。与现有工具相比,Spice使用基于象征性最弱的先决条件的在线屏蔽层获得了更精确的安全性分析,而不会不适当地影响培训过程。我们在连续控制基准的套件上评估了该方法,并表明它可以达到与现有的安全学习技术相当的性能,同时遭受较少的安全性违规行为。此外,我们提出的理论结果表明,在合理的假设下,香料会融合到最佳安全政策。
In reinforcement learning for safety-critical settings, it is often desirable for the agent to obey safety constraints at all points in time, including during training. We present a novel neurosymbolic approach called SPICE to solve this safe exploration problem. SPICE uses an online shielding layer based on symbolic weakest preconditions to achieve a more precise safety analysis than existing tools without unduly impacting the training process. We evaluate the approach on a suite of continuous control benchmarks and show that it can achieve comparable performance to existing safe learning techniques while incurring fewer safety violations. Additionally, we present theoretical results showing that SPICE converges to the optimal safe policy under reasonable assumptions.