论文标题
用视频深入信息的表示形式学习
Representation Learning with Video Deep InfoMax
论文作者
论文摘要
自我监督的学习使无监督的预处理再次与艰难的计算机视觉任务相关。最有效的自我监管方法涉及基于从数据的各种视图中提取的功能的预测任务。 DeepInfomax(DIM)是一种自我监督的方法,它利用深网的内部结构来构建此类视图,从而在局部特征之间形成依赖图像中小补丁的局部特征和依赖整个图像的全局特征的预测任务。在本文中,我们通过利用时空网络中的类似结构来扩展到视频域,并产生一种我们称为视频深信息(VDIM)的方法。我们发现,从自然率序列和时间缩小的采样序列中绘制视图会产生动力学预处理的动作识别任务的结果,这些操作识别任务匹配或超越先前的最新方法,这些方法使用更昂贵的大型变压器模型。我们还研究了仅在UCF-101数据集上训练时,数据增强和微调方法的影响很大。
Self-supervised learning has made unsupervised pretraining relevant again for difficult computer vision tasks. The most effective self-supervised methods involve prediction tasks based on features extracted from diverse views of the data. DeepInfoMax (DIM) is a self-supervised method which leverages the internal structure of deep networks to construct such views, forming prediction tasks between local features which depend on small patches in an image and global features which depend on the whole image. In this paper, we extend DIM to the video domain by leveraging similar structure in spatio-temporal networks, producing a method we call Video Deep InfoMax(VDIM). We find that drawing views from both natural-rate sequences and temporally-downsampled sequences yields results on Kinetics-pretrained action recognition tasks which match or outperform prior state-of-the-art methods that use more costly large-time-scale transformer models. We also examine the effects of data augmentation and fine-tuning methods, accomplishingSoTA by a large margin when training only on the UCF-101 dataset.