论文标题
什么和地点:从语义和空间角度进行动作识别的骨骼建模
What and Where: Modeling Skeletons from Semantic and Spatial Perspectives for Action Recognition
论文作者
论文摘要
骨骼数据仅由人类关节的2D/3D坐标组成,已被广泛研究以进行人类行动识别。现有方法将语义作为先验知识来分组人类关节并根据其空间位置进行相关性,我们称之为骨骼建模的语义观点。在本文中,与以前的方法相反,我们建议从新的空间角度对骨骼进行建模,该模型从中将空间位置作为先验知识来分组人类关节,并以层次的方式挖掘地方区域的歧视性模式。这两个观点是正交和彼此互补的。通过将它们融合到统一的框架中,我们的方法可以更全面地了解骨骼数据。此外,我们为这两个观点定制了两个网络。从语义的角度来看,我们提出了一个类似变压器的网络,该网络是建模关节相关性的专家,并提出了三种有效的技术以使其适应骨架数据。从空间的角度来看,我们将骨骼数据转换为稀疏格式,以进行有效的特征提取,并呈现两种类型的稀疏卷积网络,用于稀疏骨架建模。大量实验是在三个具有骨骼基于骨架的人类动作/手势识别的具有挑战性的数据集上进行的,即NTU-60,NTU-120和SHREC,我们的方法可以在其中实现最先进的性能。
Skeleton data, which consists of only the 2D/3D coordinates of the human joints, has been widely studied for human action recognition. Existing methods take the semantics as prior knowledge to group human joints and draw correlations according to their spatial locations, which we call the semantic perspective for skeleton modeling. In this paper, in contrast to previous approaches, we propose to model skeletons from a novel spatial perspective, from which the model takes the spatial location as prior knowledge to group human joints and mines the discriminative patterns of local areas in a hierarchical manner. The two perspectives are orthogonal and complementary to each other; and by fusing them in a unified framework, our method achieves a more comprehensive understanding of the skeleton data. Besides, we customized two networks for the two perspectives. From the semantic perspective, we propose a Transformer-like network that is expert in modeling joint correlations, and present three effective techniques to adapt it for skeleton data. From the spatial perspective, we transform the skeleton data into the sparse format for efficient feature extraction and present two types of sparse convolutional networks for sparse skeleton modeling. Extensive experiments are conducted on three challenging datasets for skeleton-based human action/gesture recognition, namely, NTU-60, NTU-120 and SHREC, where our method achieves state-of-the-art performance.