论文标题

R3M:机器人操纵的通用视觉表示

R3M: A Universal Visual Representation for Robot Manipulation

论文作者

Nair, Suraj, Rajeswaran, Aravind, Kumar, Vikash, Finn, Chelsea, Gupta, Abhinav

论文摘要

我们研究如何在各种人类视频数据上预先训练的视觉表示如何使下游机器人操纵任务的数据有效学习。具体而言,我们使用EGO4D人类视频数据集预先训练了视觉表示,结合了时间对抗性学习,视频语言对准和L1惩罚,以鼓励稀疏和紧凑的表示。由此产生的表示,R3M可以用作下游政策学习的冷冻感知模块。在12个模拟的机器人操纵任务中,与从头开始的训练相比,与最先进的视觉表示相比,R3M的成功率提高了20%以上,超过10%。此外,R3M使Franka Emika Panda的手臂能够在仅20个示威游行的真实,混乱的公寓中学习一系列操纵任务。代码和预培训模型可在https://tinyurl.com/robotr3m上找到。

We study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Concretely, we pre-train a visual representation using the Ego4D human video dataset using a combination of time-contrastive learning, video-language alignment, and an L1 penalty to encourage sparse and compact representations. The resulting representation, R3M, can be used as a frozen perception module for downstream policy learning. Across a suite of 12 simulated robot manipulation tasks, we find that R3M improves task success by over 20% compared to training from scratch and by over 10% compared to state-of-the-art visual representations like CLIP and MoCo. Furthermore, R3M enables a Franka Emika Panda arm to learn a range of manipulation tasks in a real, cluttered apartment given just 20 demonstrations. Code and pre-trained models are available at https://tinyurl.com/robotr3m.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源