机器人学习移动操作具有可及行为先验

论文标题

机器人学习移动操作具有可及行为先验

Robot Learning of Mobile Manipulation with Reachability Behavior Priors

论文作者

Jauhri, Snehal, Peters, Jan, Chalvatzaki, Georgia

论文摘要

移动操作（MM）系统是在非结构化现实环境中扮演个人助理角色的理想候选者。除其他挑战外，MM需要有效地协调机器人实施方案，以执行需要移动性和操纵的任务。强化学习（RL）的承诺是将机器人具有自适应行为，但是大多数方法都需要大量的数据来学习有用的控制策略。在这项工作中，我们研究了在参与者评价RL方法中的机器人可及能力的整合，以加速学习和获取任务的MM学习。也就是说，我们考虑了最佳基础位置的问题，以及是否激活ARM达到6D目标的后续决定。为此，我们设计了一种新型的混合RL方法，该方法可以共同处理离散和连续的动作，从而求助于gumbel-softmax重新聚集化。接下来，我们使用经典方法启发的操作机器人工作区中的数据培训可及性。随后，我们得出了增强的混合RL（BHYRL），这是一种通过将其建模为残留近似器的总和来学习Q-功能的新型算法。每当需要学习新任务时，我们就可以转移我们学到的残差并了解特定于任务的Q功能的组成部分，因此，从先前的行为中维护任务结构。此外，我们发现，将目标政策正规化的策略会产生更多的表达行为。我们评估了我们在达到难度增加和提取任务的模拟方面的方法，并显示了Bhyrl在基线方法上的卓越性能。最后，我们用Bhyrl零转移了我们学到的6D提取策略，以归功于我们的MM机器人Tiago ++。有关更多详细信息和代码发布，请参阅我们的项目网站：irosalab.com/rlmmbp

Mobile Manipulation (MM) systems are ideal candidates for taking up the role of a personal assistant in unstructured real-world environments. Among other challenges, MM requires effective coordination of the robot's embodiments for executing tasks that require both mobility and manipulation. Reinforcement Learning (RL) holds the promise of endowing robots with adaptive behaviors, but most methods require prohibitively large amounts of data for learning a useful control policy. In this work, we study the integration of robotic reachability priors in actor-critic RL methods for accelerating the learning of MM for reaching and fetching tasks. Namely, we consider the problem of optimal base placement and the subsequent decision of whether to activate the arm for reaching a 6D target. For this, we devise a novel Hybrid RL method that handles discrete and continuous actions jointly, resorting to the Gumbel-Softmax reparameterization. Next, we train a reachability prior using data from the operational robot workspace, inspired by classical methods. Subsequently, we derive Boosted Hybrid RL (BHyRL), a novel algorithm for learning Q-functions by modeling them as a sum of residual approximators. Every time a new task needs to be learned, we can transfer our learned residuals and learn the component of the Q-function that is task-specific, hence, maintaining the task structure from prior behaviors. Moreover, we find that regularizing the target policy with a prior policy yields more expressive behaviors. We evaluate our method in simulation in reaching and fetching tasks of increasing difficulty, and we show the superior performance of BHyRL against baseline methods. Finally, we zero-transfer our learned 6D fetching policy with BHyRL to our MM robot TIAGo++. For more details and code release, please refer to our project site: irosalab.com/rlmmbp

下载PDF全文

下载文献需遵守相关版权规定

论文标题