VLMBENCH：视觉和语言操纵的组成基准

论文标题

VLMBENCH：视觉和语言操纵的组成基准

VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation

论文作者

Zheng, Kaizhi, Chen, Xiaotong, Jenkins, Odest Chadwicke, Wang, Xin Eric

论文摘要

从语言灵活性和组成性中受益，人类自然打算使用语言来指挥体现的代理，以进行复杂的任务，例如导航和对象操纵。在这项工作中，我们的目标是填补最后一英里体现的代理的空白 - 通过遵循人类的指导，例如，“将红色杯子移到盒子旁边的同时保持直立的方式”。为此，我们引入了一个自动操纵求解器（AMSolver）系统，并基于IT建立视觉和语言操纵基准（VLMBENCH），其中包含有关机器人操纵任务的各种语言说明。具体而言，创建基于模块化规则的任务模板是为了自动生成语言指令的机器人演示，包括各种对象形状和外观，操作类型和运动约束。我们还开发了一个基于关键点的模型6D-Cliport，以处理多视图观察和语言输入，并输出一个6个自由度（DOF）动作的顺序。我们希望新的模拟器和基准将促进有关语言引导的机器人操作的未来研究。

Benefiting from language flexibility and compositionality, humans naturally intend to use language to command an embodied agent for complex tasks such as navigation and object manipulation. In this work, we aim to fill the blank of the last mile of embodied agents -- object manipulation by following human guidance, e.g., "move the red mug next to the box while keeping it upright." To this end, we introduce an Automatic Manipulation Solver (AMSolver) system and build a Vision-and-Language Manipulation benchmark (VLMbench) based on it, containing various language instructions on categorized robotic manipulation tasks. Specifically, modular rule-based task templates are created to automatically generate robot demonstrations with language instructions, consisting of diverse object shapes and appearances, action types, and motion constraints. We also develop a keypoint-based model 6D-CLIPort to deal with multi-view observations and language input and output a sequence of 6 degrees of freedom (DoF) actions. We hope the new simulator and benchmark will facilitate future research on language-guided robotic manipulation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题