论文标题

SBAT:用稀疏边界感知变压器的视频字幕

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

论文作者

Jin, Tao, Huang, Siyu, Chen, Ming, Li, Yingming, Zhang, Zhongfei

论文摘要

在本文中,我们关注将变压器结构应用于有效的视频字幕的问题。提出了香草变压器用于单峰语言生成任务,例如机器翻译。但是,视频字幕是一个多模式学习问题,视频功能在不同的时间步骤之间具有很大的冗余。基于这些问题,我们提出了一种称为稀疏边界感知变压器(SBAT)的新方法,以减少视频表示中的冗余。 SBAT使用多头注意的分数采用边界感知的合并操作,并从不同方案中选择各种功能。此外,SBAT还包括一个本地相关方案,以补偿稀疏操作带来的局部信息损失。基于SBAT,我们进一步提出了一个对齐的跨模式编码方案,以增强多模式相互作用。两个基准数据集的实验结果表明,在大多数指标下,SBAT优于最先进的方法。

In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源