视觉字幕功能增强视频大纲生成

论文标题

视觉字幕功能增强视频大纲生成

Visual Subtitle Feature Enhanced Video Outline Generation

论文作者

Lv, Qi, Cao, Ziqiang, Xie, Wenrui, Wang, Derui, Wang, Jingwen, Hu, Zhiwei, Zhang, Tangkun, Yuan, Ba, Li, Yuanhang, Cao, Min, Li, Wenjie, Li, Sujian, Fu, Guohong

论文摘要

随着视频数量的越来越多，对技术的需求很大，可以帮助人们迅速导航到他们感兴趣的视频片段。但是，当前的视频理解的作品主要集中在视频内容摘要上，而几乎没有努力探索视频的结构。受文本轮廓生成的启发，我们介绍了一项新颖的视频理解任务，即视频大纲生成（VOG）。该任务定义为包含两个子任务：（1）首先根据内容结构对视频进行分割，然后（2）为每个段生成一个标题。要学习和评估VOG，我们注释了一个10K+数据集，称为Duvog。具体来说，我们使用OCR工具来识别视频的字幕。然后，要求注释者将字幕分为章节，并将每个章节分为标题。在视频中，突出显示的文本往往是标题，因为它更有可能引起人们的注意。因此，我们提出了一个视觉字幕功能增强的视频大纲生成模型（VSENET），该模型将文本字幕及其视觉字体大小和位置作为输入。我们将VOG任务视为一个序列标记问题，该问题提取了标题所在的位置，然后将其重写以形成最终大纲。此外，基于视频概述和文本大纲之间的相似性，我们使用大量的文章，其中包括章节标题来预先我们的模型。 Duvog上的实验表明，我们的模型在很大程度上要优于其他基线方法，以达到视频分割级别的F1得分77.1，而标题生成水平的Rouge-l_f0.5的85.0。

With the tremendously increasing number of videos, there is a great demand for techniques that help people quickly navigate to the video segments they are interested in. However, current works on video understanding mainly focus on video content summarization, while little effort has been made to explore the structure of a video. Inspired by textual outline generation, we introduce a novel video understanding task, namely video outline generation (VOG). This task is defined to contain two sub-tasks: (1) first segmenting the video according to the content structure and then (2) generating a heading for each segment. To learn and evaluate VOG, we annotate a 10k+ dataset, called DuVOG. Specifically, we use OCR tools to recognize subtitles of videos. Then annotators are asked to divide subtitles into chapters and title each chapter. In videos, highlighted text tends to be the headline since it is more likely to attract attention. Therefore we propose a Visual Subtitle feature Enhanced video outline generation model (VSENet) which takes as input the textual subtitles together with their visual font sizes and positions. We consider the VOG task as a sequence tagging problem that extracts spans where the headings are located and then rewrites them to form the final outlines. Furthermore, based on the similarity between video outlines and textual outlines, we use a large number of articles with chapter headings to pretrain our model. Experiments on DuVOG show that our model largely outperforms other baseline methods, achieving 77.1 of F1-score for the video segmentation level and 85.0 of ROUGE-L_F0.5 for the headline generation level.

下载PDF全文

下载文献需遵守相关版权规定

论文标题