ULN：朝着指定的视觉和语言导航

论文标题

ULN：朝着指定的视觉和语言导航

ULN: Towards Underspecified Vision-and-Language Navigation

论文作者

Feng, Weixi, Fu, Tsu-Jui, Lu, Yujie, Wang, William Yang

论文摘要

视觉和语言导航（VLN）是指导使用语言说明移动到目标位置的具体代理的任务。尽管性能得到了重大改善，但细粒度指令的广泛使用仍无法表征现实中更实用的语言变化。为了填补这一空白，我们引入了一个新的设置，即指定的视觉和语言导航（ULN）和相关的评估数据集。 ULN使用多级别的未指定指令评估试剂，而不是纯粹的细粒度或粗粒，这是一个更现实和一般的环境。作为迈向ULN的主要步骤，我们提出了一个由分类模块，导航代理和开发探索（E2E）模块组成的VLN框架。具体而言，我们建议学习特定于粒度的子网络（GSS），以使代理以最小的其他参数接地多层指令。然后，我们的E2E模块估计了基础不确定性，并进行多步探索以进一步提高成功率。实验结果表明，现有的VLN模型仍然对多级语言未指定易碎。我们的框架更强大，并且在所有级别上的相对成功率均优于ULN的基准。

Vision-and-Language Navigation (VLN) is a task to guide an embodied agent moving to a target position using language instructions. Despite the significant performance improvement, the wide use of fine-grained instructions fails to characterize more practical linguistic variations in reality. To fill in this gap, we introduce a new setting, namely Underspecified vision-and-Language Navigation (ULN), and associated evaluation datasets. ULN evaluates agents using multi-level underspecified instructions instead of purely fine-grained or coarse-grained, which is a more realistic and general setting. As a primary step toward ULN, we propose a VLN framework that consists of a classification module, a navigation agent, and an Exploitation-to-Exploration (E2E) module. Specifically, we propose to learn Granularity Specific Sub-networks (GSS) for the agent to ground multi-level instructions with minimal additional parameters. Then, our E2E module estimates grounding uncertainty and conducts multi-step lookahead exploration to improve the success rate further. Experimental results show that existing VLN models are still brittle to multi-level language underspecification. Our framework is more robust and outperforms the baselines on ULN by ~10% relative success rate across all levels.

下载PDF全文

下载文献需遵守相关版权规定

论文标题