通过跨模式接地和替代对抗性学习的语言指导导航

论文标题

通过跨模式接地和替代对抗性学习的语言指导导航

Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning

论文作者

Zhang, Weixia, Ma, Chao, Wu, Qi, Yang, Xiaokang

论文摘要

新兴的视觉和语言导航（VLN）问题旨在根据给定的语言教学在看不见的照片真实环境中学习导航代理到目标位置。 VLN的主要挑战主要来自两个方面：首先，代理需要关注与动态变化的视觉环境相对应的语言教学的有意义的段落；其次，在训练过程中，代理通常模仿到目标位置的最短路径。由于训练和推理之间的行动选择差异，仅基于模仿学习的代理人表现不佳。从训练过程中预测的概率分布中对下一个动作进行采样，使代理可以探索来自环境的各种路线，从而产生更高的成功率。然而，如果不在训练过程中出现最短的导航路径，则代理可以通过意外的更长路线到达目标位置。为了克服这些挑战，我们设计了一个由两种互补注意机制组成的跨模式接地模块，以使代理具有更好的能力来跟踪文本和视觉方式之间的对应关系。然后，我们提议递归地交替交替进行模仿和探索的学习方案，以缩小训练和推理之间的差异。我们通过对抗性学习进一步利用这两种学习方案的优势。在房间对室（R2R）基准数据集上的广泛实验结果表明，所提出的学习方案是普遍的，并且与先前的艺术相辅相成。我们的方法在有效性和效率方面对最先进的方法表现良好。

The emerging vision-and-language navigation (VLN) problem aims at learning to navigate an agent to the target location in unseen photo-realistic environments according to the given language instruction. The main challenges of VLN arise mainly from two aspects: first, the agent needs to attend to the meaningful paragraphs of the language instruction corresponding to the dynamically-varying visual environments; second, during the training process, the agent usually imitate the shortest-path to the target location. Due to the discrepancy of action selection between training and inference, the agent solely on the basis of imitation learning does not perform well. Sampling the next action from its predicted probability distribution during the training process allows the agent to explore diverse routes from the environments, yielding higher success rates. Nevertheless, without being presented with the shortest navigation paths during the training process, the agent may arrive at the target location through an unexpected longer route. To overcome these challenges, we design a cross-modal grounding module, which is composed of two complementary attention mechanisms, to equip the agent with a better ability to track the correspondence between the textual and visual modalities. We then propose to recursively alternate the learning schemes of imitation and exploration to narrow the discrepancy between training and inference. We further exploit the advantages of both these two learning schemes via adversarial learning. Extensive experimental results on the Room-to-Room (R2R) benchmark dataset demonstrate that the proposed learning scheme is generalized and complementary to prior arts. Our method performs well against state-of-the-art approaches in terms of effectiveness and efficiency.

下载PDF全文

下载文献需遵守相关版权规定

论文标题