视频动作识别的闸档插入

论文标题

视频动作识别的闸档插入

Gate-Shift-Fuse for Video Action Recognition

论文作者

Sudhakaran, Swathikiran, Escalera, Sergio, Lanz, Oswald

论文摘要

卷积神经网络是图像识别的事实上的模型。然而，3D CNN，即2D CNN的直接向前扩展用于视频识别，在标准动作识别基准上并未取得相同的成功。 3D CNN的性能降低的主要原因之一是计算复杂性提高，需要大规模注释的数据集进行规模训练。已经提出了3D内核分解方法来降低3D CNN的复杂性。现有的内核分解方法遵循手工设计和硬连线技术。在本文中，我们提出了一种新型时空特征提取模块，该模块是一种新型的时空特征提取模块，该模块控制时空分解中的相互作用，并学会通过时间来适应性地路由特征并以数据依赖性方式组合它们。 GSF利用将空间门控进行分解以分解输入张量和通道加权以融合分解的张量。 GSF可以插入现有的2D CNN中，以将其转换为具有可忽略不计的参数和计算开销的高效且高性能的时空特征提取器。我们使用两个流行的2D CNN家族对GSF进行了广泛的分析，并在五个标准的动作识别基准上实现了最先进或竞争性能。

Convolutional Neural Networks are the de facto models for image recognition. However 3D CNNs, the straight forward extension of 2D CNNs for video recognition, have not achieved the same success on standard action recognition benchmarks. One of the main reasons for this reduced performance of 3D CNNs is the increased computational complexity requiring large scale annotated datasets to train them in scale. 3D kernel factorization approaches have been proposed to reduce the complexity of 3D CNNs. Existing kernel factorization approaches follow hand-designed and hard-wired techniques. In this paper we propose Gate-Shift-Fuse (GSF), a novel spatio-temporal feature extraction module which controls interactions in spatio-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner. GSF leverages grouped spatial gating to decompose input tensor and channel weighting to fuse the decomposed tensors. GSF can be inserted into existing 2D CNNs to convert them into an efficient and high performing spatio-temporal feature extractor, with negligible parameter and compute overhead. We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题