SBAT：用稀疏边界感知变压器的视频字幕

论文标题

SBAT：用稀疏边界感知变压器的视频字幕

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

论文作者

Jin, Tao, Huang, Siyu, Chen, Ming, Li, Yingming, Zhang, Zhongfei

论文摘要

在本文中，我们关注将变压器结构应用于有效的视频字幕的问题。提出了香草变压器用于单峰语言生成任务，例如机器翻译。但是，视频字幕是一个多模式学习问题，视频功能在不同的时间步骤之间具有很大的冗余。基于这些问题，我们提出了一种称为稀疏边界感知变压器（SBAT）的新方法，以减少视频表示中的冗余。 SBAT使用多头注意的分数采用边界感知的合并操作，并从不同方案中选择各种功能。此外，SBAT还包括一个本地相关方案，以补偿稀疏操作带来的局部信息损失。基于SBAT，我们进一步提出了一个对齐的跨模式编码方案，以增强多模式相互作用。两个基准数据集的实验结果表明，在大多数指标下，SBAT优于最先进的方法。

In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题