论文标题
在有限的时间内学习安全
Learning to be safe, in finite time
论文作者
论文摘要
本文旨在提出这样的概念,即即使有了概率保证,就可以在不需要无限的探索性试验中学习采取安全的行动,即使有一种保证,只要人们愿意温和放松其最佳要求。我们专注于规范的多军匪徒问题,并试图研究安全学习中的探索保护权衡。更确切地说,通过定义一个算计算不安全动作数量的残障度量,我们提供了一种算法,用于丢弃不安全的机器(或动作)的概率,从而实现恒定的障碍。我们的算法植根于经典的顺序概率比测试,在此重新定义以进行持续任务。根据对足够探索的标准假设,我们的规则可证明在(预期)有限的回合中检测所有不安全的机器。该分析还揭示了确保环境所需的回合数量与丢弃安全机器的可能性之间的权衡。我们的决策规则可以围绕任何其他算法来优化特定的辅助目标,因为它为搜索(大约)最佳策略提供了安全的环境。模拟证实了我们的理论发现,并进一步说明了上述权衡。
This paper aims to put forward the concept that learning to take safe actions in unknown environments, even with probability one guarantees, can be achieved without the need for an unbounded number of exploratory trials, provided that one is willing to relax its optimality requirements mildly. We focus on the canonical multi-armed bandit problem and seek to study the exploration-preservation trade-off intrinsic within safe learning. More precisely, by defining a handicap metric that counts the number of unsafe actions, we provide an algorithm for discarding unsafe machines (or actions), with probability one, that achieves constant handicap. Our algorithm is rooted in the classical sequential probability ratio test, redefined here for continuing tasks. Under standard assumptions on sufficient exploration, our rule provably detects all unsafe machines in an (expected) finite number of rounds. The analysis also unveils a trade-off between the number of rounds needed to secure the environment and the probability of discarding safe machines. Our decision rule can wrap around any other algorithm to optimize a specific auxiliary goal since it provides a safe environment to search for (approximately) optimal policies. Simulations corroborate our theoretical findings and further illustrate the aforementioned trade-offs.