通过对场景的视觉观察和以前动作的Q值预测来学习多步机器人操纵策略

论文标题

通过对场景的视觉观察和以前动作的Q值预测来学习多步机器人操纵策略

Learning Multi-step Robotic Manipulation Policies from Visual Observation of Scene and Q-value Predictions of Previous Action

论文作者

Kumra, Sulabh, Joshi, Shirin, Sahin, Ferat

论文摘要

在这项工作中，我们专注于涉及长途计划并考虑进步逆转的多步操作任务。这些任务与高级推理相互关联，这些推理由预期的国家组成，这些国家可以实现总体任务和低级推理，以决定哪些行动将产生这些国家。我们提出了一个有效的先前动作机器人操纵网络（PAC-Romantet）的样本，以了解动作值功能，并通过对场景的视觉观察和先前动作的动作值预测来预测候选操作动作候选。我们定义了一个基于任务进度的高斯（TPG）奖励功能，该功能基于导致成功运动原始的行动计算奖励，并朝着整体任务目标迈出了进步。为了平衡勘探/开发的比率，我们引入了损失调整后的勘探（LAE）政策，该政策根据损失估算的玻尔兹曼分布来确定候选行动候选者的行动。我们通过训练PAC-Romannet来学习几个具有挑战性的多步机器人操纵任务，从而证明了方法的有效性。实验结果表明，我们的方法在成功率和行动效率方面优于现有方法，并实现最先进的绩效。消融研究表明，TPG和LAE对多个块堆叠等任务特别有益。关于Ravens-10基准任务的其他实验表明，提议的PAC驾驶员良好的概括性。

In this work, we focus on multi-step manipulation tasks that involve long-horizon planning and considers progress reversal. Such tasks interlace high-level reasoning that consists of the expected states that can be attained to achieve an overall task and low-level reasoning that decides what actions will yield these states. We propose a sample efficient Previous Action Conditioned Robotic Manipulation Network (PAC-RoManNet) to learn the action-value functions and predict manipulation action candidates from visual observation of the scene and action-value predictions of the previous action. We define a Task Progress based Gaussian (TPG) reward function that computes the reward based on actions that lead to successful motion primitives and progress towards the overall task goal. To balance the ratio of exploration/exploitation, we introduce a Loss Adjusted Exploration (LAE) policy that determines actions from the action candidates according to the Boltzmann distribution of loss estimates. We demonstrate the effectiveness of our approach by training PAC-RoManNet to learn several challenging multi-step robotic manipulation tasks in both simulation and real-world. Experimental results show that our method outperforms the existing methods and achieves state-of-the-art performance in terms of success rate and action efficiency. The ablation studies show that TPG and LAE are especially beneficial for tasks like multiple block stacking. Additional experiments on Ravens-10 benchmark tasks suggest good generalizability of the proposed PAC-RoManNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题