一个用于交互式视觉语言导航的数据集，具有未知命令可行性

论文标题

一个用于交互式视觉语言导航的数据集，具有未知命令可行性

A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility

论文作者

Burns, Andrea, Arsan, Deniz, Agrawal, Sanjna, Kumar, Ranjitha, Saenko, Kate, Plummer, Bryan A.

论文摘要

视觉语言导航（VLN）是在视觉环境中遵循语言指令的，在该前提是输入命令在环境中是完全可行的。然而，实际上，由于语言歧义或环境的变化，可能无法提出要求。为了使用未知的命令可行性研究VLN，我们引入了一个新的数据集移动应用程序任务，其中具有迭代反馈（Motif），目标是在移动应用程序中完成自然语言命令。移动应用程序提供了一个可扩展的域来研究VLN方法的真正下游用途。此外，移动应用命令提供了交互式导航的指令，因为它们通过单击，键入或刷新而导致状态更改的动作序列。主题是第一个包含可行性注释的主题，其中包含二进制可行性标签和细粒度标签，原因是为什么任务不满意。我们进一步收集了含糊不清的疑问的后续问题，以实现有关解决任务不确定性的研究。配备了我们的数据集，我们提出了可行性预测的新问题，其中使用自然语言指令和多模式应用程序环境来预测命令可行性。主题提供了一个更现实的应用数据集，因为它包含比先前的工作相比，它包含许多不同的环境，高级目标和更长的动作序列。我们使用主题评估交互式VLN方法，量化当前方法对新应用环境的概括能力，并衡量任务可行性对导航性能的影响。

Vision-language navigation (VLN), in which an agent follows language instruction in a visual environment, has been studied under the premise that the input command is fully feasible in the environment. Yet in practice, a request may not be possible due to language ambiguity or environment changes. To study VLN with unknown command feasibility, we introduce a new dataset Mobile app Tasks with Iterative Feedback (MoTIF), where the goal is to complete a natural language command in a mobile app. Mobile apps provide a scalable domain to study real downstream uses of VLN methods. Moreover, mobile app commands provide instruction for interactive navigation, as they result in action sequences with state changes via clicking, typing, or swiping. MoTIF is the first to include feasibility annotations, containing both binary feasibility labels and fine-grained labels for why tasks are unsatisfiable. We further collect follow-up questions for ambiguous queries to enable research on task uncertainty resolution. Equipped with our dataset, we propose the new problem of feasibility prediction, in which a natural language instruction and multimodal app environment are used to predict command feasibility. MoTIF provides a more realistic app dataset as it contains many diverse environments, high-level goals, and longer action sequences than prior work. We evaluate interactive VLN methods using MoTIF, quantify the generalization ability of current approaches to new app environments, and measure the effect of task feasibility on navigation performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题