论文标题
通过问问题来学习检索视频
Learning to Retrieve Videos by Asking Questions
论文作者
论文摘要
大多数传统的文本到视频检索系统都在静态环境中运行,即,除了用户提供的初始文本查询之外,用户与代理之间没有相互作用。如果初始查询具有歧义,这可能是次优的,这将导致许多错误的检索视频。为了克服这一限制,我们为使用对话框(VIRED)提出了一个新颖的视频检索框架,该框架使用户能够通过多轮对话框与AI代理进行交互,用户通过回答AI代理产生的问题来完善结果。我们的新颖的多模式问题生成器学会了提出问题,以最大程度地提高随后的视频检索性能,使用(i)在与用户的最后一轮互动中检索到的视频候选者以及(ii)基于文本的对话框历史记录所有以前的互动,以产生与视频检索相关的各种问题,以产生与视觉识别相关的问题。此外,为了产生最大信息的问题,我们提出了一个信息引导的监督(IGS),该监督指导生成器提出问题,以提高随后的视频检索准确性。我们在AVSD数据集上验证了我们的交互式有效框架的有效性,这表明我们的交互式方法的性能明显优于传统的非交互式视频检索系统。我们还证明,我们提出的方法将涉及与真实人类互动的现实环境推广,从而证明了我们框架的鲁棒性和一般性
The majority of traditional text-to-video retrieval systems operate in static environments, i.e., there is no interaction between the user and the agent beyond the initial textual query provided by the user. This can be sub-optimal if the initial query has ambiguities, which would lead to many falsely retrieved videos. To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog, where the user refines retrieved results by answering questions generated by an AI agent. Our novel multimodal question generator learns to ask questions that maximize the subsequent video retrieval performance using (i) the video candidates retrieved during the last round of interaction with the user and (ii) the text-based dialog history documenting all previous interactions, to generate questions that incorporate both visual and linguistic cues relevant to video retrieval. Furthermore, to generate maximally informative questions, we propose an Information-Guided Supervision (IGS), which guides the question generator to ask questions that would boost subsequent video retrieval accuracy. We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems. We also demonstrate that our proposed approach generalizes to the real-world settings that involve interactions with real humans, thus, demonstrating the robustness and generality of our framework