离线机器人加固学习不确定性引导的人类专家抽样

论文标题

离线机器人加固学习不确定性引导的人类专家抽样

Offline Robot Reinforcement Learning with Uncertainty-Guided Human Expert Sampling

论文作者

Kumar, Ashish, Kuzovkin, Ilya

论文摘要

批处理（离线）强化学习的最新进展显示出了从可用的离线数据学习的有希望的结果，并证明离线增强学习是在无模型环境中学习控制政策的重要工具包。应用于次优的非学习算法收集的数据集的一个离线增强学习算法可能会导致策略胜过胜过收集数据的行为代理。这种情况在机器人技术中经常存在，其中现有自动化正在收集操作数据。尽管离线学习技术可以从次优行为代理人生成的数据中学习，但仍然有机会通过将人类的演示数据引入培训过程中，从而提高现有离线增强学习算法的样本复杂性。为此，我们提出了一种新颖的方法，该方法使用不确定性估计来触发人类演示数据的注入，并指导政策培训以最佳行为降低整体样本复杂性。我们的实验表明，与将专家数据与从最佳剂收集的数据相结合的幼稚方式相比，这种方法更有效。我们通过方法增强了现有的离线增强学习算法保守Q学习算法，并对从Mujoco和Offworld Gym学习环境收集的数据进行了实验。

Recent advances in batch (offline) reinforcement learning have shown promising results in learning from available offline data and proved offline reinforcement learning to be an essential toolkit in learning control policies in a model-free setting. An offline reinforcement learning algorithm applied to a dataset collected by a suboptimal non-learning-based algorithm can result in a policy that outperforms the behavior agent used to collect the data. Such a scenario is frequent in robotics, where existing automation is collecting operational data. Although offline learning techniques can learn from data generated by a sub-optimal behavior agent, there is still an opportunity to improve the sample complexity of existing offline reinforcement learning algorithms by strategically introducing human demonstration data into the training process. To this end, we propose a novel approach that uses uncertainty estimation to trigger the injection of human demonstration data and guide policy training towards optimal behavior while reducing overall sample complexity. Our experiments show that this approach is more sample efficient when compared to a naive way of combining expert data with data collected from a sub-optimal agent. We augmented an existing offline reinforcement learning algorithm Conservative Q-Learning with our approach and performed experiments on data collected from MuJoCo and OffWorld Gym learning environments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题