动作图像表示：学习可扩展的深度掌握策略，零现实世界数据

论文标题

动作图像表示：学习可扩展的深度掌握策略，零现实世界数据

Action Image Representation: Learning Scalable Deep Grasping Policies with Zero Real World Data

论文作者

Khansari, Mohi, Kappler, Daniel, Luo, Jianlan, Bingham, Jeff, Kalakrishnan, Mrinal

论文摘要

本文介绍了Action Image，这是一种新的GRASP建议表示形式，允许学习端到端的深层策略。我们的型号获得了$ 84 \％$ crasp成功的$ 172 $现实世界对象，同时仅在模拟$ 48 $的对象中接受培训，而只有幼稚的域随机化。与计算机视觉问题（例如对象检测）类似，动作图像基于对象特征在图像空间中不变的想法。因此，在评估对象绑架关系时，抓握质量是不变的。对物体的成功掌握取决于其本地环境，但独立于周围环境。动作图像代表一个掌握建议作为图像，并使用深层卷积网络推断掌握质量。我们表明，通过使用动作图像表示，训练有素的网络能够提取掌握任务的本地，显着特征，这些任务跨越了不同的对象和环境。我们表明，此表示形式可在各种输入中起作用，包括颜色图像（RGB），深度图像（D）和颜色深度（RGB-D）。我们的实验结果表明，利用动作图像表示的网络在模拟数据上的训练与对现实世界传感器流的推断之间表现出强大的域转移。最后，我们的实验表明，经过动作图像训练的网络改善了掌握成功（$ 84 \％\％$ vs. $ 53 \％$ $），而基线模型具有相同的结构，但使用编码为向量的动作。

This paper introduces Action Image, a new grasp proposal representation that allows learning an end-to-end deep-grasping policy. Our model achieves $84\%$ grasp success on $172$ real world objects while being trained only in simulation on $48$ objects with just naive domain randomization. Similar to computer vision problems, such as object detection, Action Image builds on the idea that object features are invariant to translation in image space. Therefore, grasp quality is invariant when evaluating the object-gripper relationship; a successful grasp for an object depends on its local context, but is independent of the surrounding environment. Action Image represents a grasp proposal as an image and uses a deep convolutional network to infer grasp quality. We show that by using an Action Image representation, trained networks are able to extract local, salient features of grasping tasks that generalize across different objects and environments. We show that this representation works on a variety of inputs, including color images (RGB), depth images (D), and combined color-depth (RGB-D). Our experimental results demonstrate that networks utilizing an Action Image representation exhibit strong domain transfer between training on simulated data and inference on real-world sensor streams. Finally, our experiments show that a network trained with Action Image improves grasp success ($84\%$ vs. $53\%$) over a baseline model with the same structure, but using actions encoded as vectors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题