论文标题
跨模式学习具有3D可变形的注意力识别
Cross-Modal Learning with 3D Deformable Attention for Action Recognition
论文作者
论文摘要
基于视觉的动作识别的一个重要挑战是将两个或多种异质方式的时空特征嵌入一个特征中。在这项研究中,我们提出了一种新的3D可变形变压器,用于使用自适应时空接受场和跨模式学习方案进行动作识别。 3D可变形变压器由三个注意模块组成:3D变形性,局部关节步幅和暂时的注意力。两个跨模式令牌被输入到3D可变形的注意模块中,以创建具有反射时空相关性的交叉注意令牌。局部关节的关注被应用于空间相结合和姿势令牌。暂时的注意力在时间上减少了注意模块中输入令牌的数量,并支持时间表达学习,而无需同时使用所有令牌。可变形的变压器迭代L时间时,并结合了最后一个交叉模式令牌进行分类。提出的3D可变形变压器在NTU60,NTU120,FINEGYM和PENNACTION数据集上进行了测试,并且即使没有预训练过程,也可以更好地显示出比预先训练的最新方法或与预训练的最新方法相似的结果。此外,通过想象通过空间关节和暂时性大步关注在动作识别过程中的重要关节和相关性,提出了实现可解释的动作识别潜力的可能性。
An important challenge in vision-based action recognition is the embedding of spatiotemporal features with two or more heterogeneous modalities into a single feature. In this study, we propose a new 3D deformable transformer for action recognition with adaptive spatiotemporal receptive fields and a cross-modal learning scheme. The 3D deformable transformer consists of three attention modules: 3D deformability, local joint stride, and temporal stride attention. The two cross-modal tokens are input into the 3D deformable attention module to create a cross-attention token with a reflected spatiotemporal correlation. Local joint stride attention is applied to spatially combine attention and pose tokens. Temporal stride attention temporally reduces the number of input tokens in the attention module and supports temporal expression learning without the simultaneous use of all tokens. The deformable transformer iterates L-times and combines the last cross-modal token for classification. The proposed 3D deformable transformer was tested on the NTU60, NTU120, FineGYM, and PennAction datasets, and showed results better than or similar to pre-trained state-of-the-art methods even without a pre-training process. In addition, by visualizing important joints and correlations during action recognition through spatial joint and temporal stride attention, the possibility of achieving an explainable potential for action recognition is presented.