论文标题
ULN:朝着指定的视觉和语言导航
ULN: Towards Underspecified Vision-and-Language Navigation
论文作者
论文摘要
视觉和语言导航(VLN)是指导使用语言说明移动到目标位置的具体代理的任务。尽管性能得到了重大改善,但细粒度指令的广泛使用仍无法表征现实中更实用的语言变化。为了填补这一空白,我们引入了一个新的设置,即指定的视觉和语言导航(ULN)和相关的评估数据集。 ULN使用多级别的未指定指令评估试剂,而不是纯粹的细粒度或粗粒,这是一个更现实和一般的环境。作为迈向ULN的主要步骤,我们提出了一个由分类模块,导航代理和开发探索(E2E)模块组成的VLN框架。具体而言,我们建议学习特定于粒度的子网络(GSS),以使代理以最小的其他参数接地多层指令。然后,我们的E2E模块估计了基础不确定性,并进行多步探索以进一步提高成功率。实验结果表明,现有的VLN模型仍然对多级语言未指定易碎。我们的框架更强大,并且在所有级别上的相对成功率均优于ULN的基准。
Vision-and-Language Navigation (VLN) is a task to guide an embodied agent moving to a target position using language instructions. Despite the significant performance improvement, the wide use of fine-grained instructions fails to characterize more practical linguistic variations in reality. To fill in this gap, we introduce a new setting, namely Underspecified vision-and-Language Navigation (ULN), and associated evaluation datasets. ULN evaluates agents using multi-level underspecified instructions instead of purely fine-grained or coarse-grained, which is a more realistic and general setting. As a primary step toward ULN, we propose a VLN framework that consists of a classification module, a navigation agent, and an Exploitation-to-Exploration (E2E) module. Specifically, we propose to learn Granularity Specific Sub-networks (GSS) for the agent to ground multi-level instructions with minimal additional parameters. Then, our E2E module estimates grounding uncertainty and conducts multi-step lookahead exploration to improve the success rate further. Experimental results show that existing VLN models are still brittle to multi-level language underspecification. Our framework is more robust and outperforms the baselines on ULN by ~10% relative success rate across all levels.