视频模型中的独立框架间关注

论文标题

视频模型中的独立框架间关注

Stand-Alone Inter-Frame Attention in Video Models

论文作者

Long, Fuchen, Qiu, Zhaofan, Pan, Yingwei, Yao, Ting, Luo, Jiebo, Mei, Tao

论文摘要

作为视频的独特性，运动对于开发视频理解模型至关重要。现代深度学习模型通过执行时空3D卷积来利用运动，将3D卷积分别分别分为空间和时间卷积，或者沿时间维度计算自我注意力。这种成功背后的隐含假设是，可以很好地汇总连续框架的特征图。然而，该假设可能并不总是适用于较大变形的地区。在本文中，我们提出了一个新的框架间注意区块的食谱，即独立框架间注意力（SIFA），它在新颖地研究了整个框架的变形，以估计每个空间位置上的局部自我注意力。从技术上讲，SIFA通过通过两个帧之间的差来重新缩放偏移预测来重新缩放可变形设计。将每个空间位置作为查询，将每个空间位置作为查询，下一个帧中的本地可变形邻居被视为键/值。然后，SIFA衡量查询和键之间的相似性是对加权平均时间聚集值的独立关注。我们进一步将SIFA块分别插入Convnet和Vision Transformer，以设计SIFA-NET和SIFA变形器。在四个视频数据集上进行的广泛实验表明，SIFA-NET和SIFA-TRANSFORMER的优越性是更强的骨架。更值得注意的是，SIFA转换器在Kinetics-400数据集上的精度为83.1％。源代码可在\ url {https://github.com/fuchenustc/sifa}中获得。

Motion, as the uniqueness of a video, has been critical to the development of video understanding models. Modern deep learning models leverage motion by either executing spatio-temporal 3D convolutions, factorizing 3D convolutions into spatial and temporal convolutions separately, or computing self-attention along temporal dimension. The implicit assumption behind such successes is that the feature maps across consecutive frames can be nicely aggregated. Nevertheless, the assumption may not always hold especially for the regions with large deformation. In this paper, we present a new recipe of inter-frame attention block, namely Stand-alone Inter-Frame Attention (SIFA), that novelly delves into the deformation across frames to estimate local self-attention on each spatial location. Technically, SIFA remoulds the deformable design via re-scaling the offset predictions by the difference between two frames. Taking each spatial location in the current frame as the query, the locally deformable neighbors in the next frame are regarded as the keys/values. Then, SIFA measures the similarity between query and keys as stand-alone attention to weighted average the values for temporal aggregation. We further plug SIFA block into ConvNets and Vision Transformer, respectively, to devise SIFA-Net and SIFA-Transformer. Extensive experiments conducted on four video datasets demonstrate the superiority of SIFA-Net and SIFA-Transformer as stronger backbones. More remarkably, SIFA-Transformer achieves an accuracy of 83.1% on Kinetics-400 dataset. Source code is available at \url{https://github.com/FuchenUSTC/SIFA}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题