关于运行时间保证如何影响强化学习者的培训和表现的消融研究

论文标题

关于运行时间保证如何影响强化学习者的培训和表现的消融研究

Ablation Study of How Run Time Assurance Impacts the Training and Performance of Reinforcement Learning Agents

论文作者

Hamilton, Nathaniel, Dunlap, Kyle, Johnson, Taylor T, Hobbs, Kerianne L

论文摘要

随着机器学习算法和方法的成功，增强学习（RL）已成为越来越重要的研究领域。为了应对在培训期间围绕RL代理的自由的安全问题，有关安全加固学习（SRL）的工作有所增加。但是，这些新的和安全的方法的审查少于其不安全的对应物。例如，安全方法之间的比较通常缺乏在相似的初始条件边界和超参数设置，使用较差的评估指标以及樱桃挑选最佳训练的情况下进行的公平评估，而不是在多个随机种子上平均。在这项工作中，我们使用评估最佳实践进行消融研究，以调查运行时间保证（RTA）的影响，该研究可以监视系统状态并干预以确保安全性，以确保安全性。通过研究在政策和非政策RL算法中的多种RTA方法，我们试图了解哪种RTA方法最有效，无论代理是否依赖RTA，以及奖励成型的重要性与在RL剂培训中安全探索的重要性。我们的结论阐明了SRL的最有希望的方向，我们的评估方法为在未来的SRL工作中进行了更好的比较奠定了基础。

Reinforcement Learning (RL) has become an increasingly important research area as the success of machine learning algorithms and methods grows. To combat the safety concerns surrounding the freedom given to RL agents while training, there has been an increase in work concerning Safe Reinforcement Learning (SRL). However, these new and safe methods have been held to less scrutiny than their unsafe counterparts. For instance, comparisons among safe methods often lack fair evaluation across similar initial condition bounds and hyperparameter settings, use poor evaluation metrics, and cherry-pick the best training runs rather than averaging over multiple random seeds. In this work, we conduct an ablation study using evaluation best practices to investigate the impact of run time assurance (RTA), which monitors the system state and intervenes to assure safety, on effective learning. By studying multiple RTA approaches in both on-policy and off-policy RL algorithms, we seek to understand which RTA methods are most effective, whether the agents become dependent on the RTA, and the importance of reward shaping versus safe exploration in RL agent training. Our conclusions shed light on the most promising directions of SRL, and our evaluation methodology lays the groundwork for creating better comparisons in future SRL work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题