自我监督视频表示学习的层次分层时空对比度学习

论文标题

自我监督视频表示学习的层次分层时空对比度学习

Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning

论文作者

Zhang, Zehua, Crandall, David

论文摘要

我们提出了一种新颖的技术，可以通过以下方式进行自我监督的视频表示学习：（a）将学习目标分别分别为两个对比的子任务子任务，分别强调了空间和时间特征，以及（b）在层次上进行层次进行多尺度理解。由于它们在监督学习方面的有效性而激发，我们首先引入了时空特征学习解耦和等级学习的动机，以无监督的视频学习的背景。我们通过实验表明，可以将增强作用作为正规化操纵，以指导网络在对比学习中学习所需的语义，我们为模型提出了一种在多个尺度上分别捕获空间和时间特征的方法。我们还引入了一种方法来克服不同层次结构的实例不变性问题的问题，通过将不变性作为客观重新加权的减肥重量进行建模。 UCF101和HMDB51上下游动作识别基准的实验表明，我们提出的层次分层脱钩的时空对比（HDC）对直接学习的空间 - 静态特征的实质性改进，与其他国家的竞争性表现相比，与其他国家的竞争性表现相比，与其他国家 /地区的竞争性表现相比。代码将提供。

We present a novel technique for self-supervised video representation learning by: (a) decoupling the learning objective into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) performing it hierarchically to encourage multi-scale understanding. Motivated by their effectiveness in supervised learning, we first introduce spatial-temporal feature learning decoupling and hierarchical learning to the context of unsupervised video learning. We show by experiments that augmentations can be manipulated as regularization to guide the network to learn desired semantics in contrastive learning, and we propose a way for the model to separately capture spatial and temporal features at multiple scales. We also introduce an approach to overcome the problem of divergent levels of instance invariance at different hierarchies by modeling the invariance as loss weights for objective re-weighting. Experiments on downstream action recognition benchmarks on UCF101 and HMDB51 show that our proposed Hierarchically Decoupled Spatial-Temporal Contrast (HDC) makes substantial improvements over directly learning spatial-temporal features as a whole and achieves competitive performance when compared with other state-of-the-art unsupervised methods. Code will be made available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题