Vilt：链接复杂任务的视频说明

论文标题

Vilt：链接复杂任务的视频说明

VILT: Video Instructions Linking for Complex Tasks

论文作者

Fischer, Sophie, Gemmell, Carlos, Mackie, Iain, Dalton, Jeffrey

论文摘要

这项工作解决了开发对话助理的挑战，这些助手支持丰富的多模式视频交互，以交互方式完成现实世界任务。我们介绍了将教学视频自动链接到任务步骤的任务，因为“视频说明链接了复杂任务”（VILT）。具体来说，我们专注于烹饪的领域，并赋予用户与视频支持的Alexa技能交互烹饪餐点。我们创建了一个可重复使用的基准测试，并通过食谱任务进行了61个查询，并策划了2,133个教学“操作方法”烹饪视频的集合。通过最新的检索方法研究vilt，我们发现使用ANCE的密集检索是最有效的，可以实现0.566的NDCG@3和0.644的p@1。我们还进行了一项用户研究，该研究衡量将视频纳入现实世界任务设置的效果，其中有10位参与者使用最先进的Alexa Taskbot系统执行了多种多模式实验条件的烹饪任务。用户与手动链接的视频进行互动的用户说，他们有64％的时间学到了一些新的东西，与自动链接的视频相比，这一数字增加了9％（55％），这表明链接的视频相关性对于任务学习很重要。

This work addresses challenges in developing conversational assistants that support rich multimodal video interactions to accomplish real-world tasks interactively. We introduce the task of automatically linking instructional videos to task steps as "Video Instructions Linking for Complex Tasks" (VILT). Specifically, we focus on the domain of cooking and empowering users to cook meals interactively with a video-enabled Alexa skill. We create a reusable benchmark with 61 queries from recipe tasks and curate a collection of 2,133 instructional "How-To" cooking videos. Studying VILT with state-of-the-art retrieval methods, we find that dense retrieval with ANCE is the most effective, achieving an NDCG@3 of 0.566 and P@1 of 0.644. We also conduct a user study that measures the effect of incorporating videos in a real-world task setting, where 10 participants perform several cooking tasks with varying multimodal experimental conditions using a state-of-the-art Alexa TaskBot system. The users interacting with manually linked videos said they learned something new 64% of the time, which is a 9% increase compared to the automatically linked videos (55%), indicating that linked video relevance is important for task learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题