对比度性蒙面的自动编码器，用于自我监督视频哈希

论文标题

对比度性蒙面的自动编码器，用于自我监督视频哈希

Contrastive Masked Autoencoders for Self-Supervised Video Hashing

论文作者

Wang, Yuting, Wang, Jinpeng, Chen, Bin, Zeng, Ziyun, Xia, Shutao

论文摘要

自我监督的视频散列（SSVH）模型学会学会为视频产生简短的二进制表示，而无需基础真相监督，促进了大规模的视频检索效率并引起了越来越多的研究注意力。 SSVH的成功在于对视频内容的理解以及捕获未标记视频中语义关系的能力。通常，最先进的SSVH方法在两阶段的训练管道中考虑这两个点，在那里他们首先通过实例的面具和预测的任务训练辅助网络，其次训练哈希模型，以保留从辅助网络中转移的伪纽布型结构。这种连续的培训策略是僵化的，也不需要。在本文中，我们提出了一种名为CONMH的简单而有效的单阶段SSVH方法，该方法在一个阶段中结合了视频语义信息和视频相似性关系的理解。为了捕获视频语义信息以获得更好的哈希学习，我们采用编码器码头结构来重建视频的时间掩盖框架。特别是，我们发现较高的掩蔽率有助于视频理解。此外，我们通过最大化视频的两种增强观点之间的一致性，充分利用视频之间的相似性关系，这有助于更具歧视性和强大的哈希代码。在三个大规模视频数据集（即FCVID，ActivityNet和YFCC）上进行了广泛的实验表明，CONMH可实现最新的结果。代码可从https://github.com/huangmozhi9527/conmh获得。

Self-Supervised Video Hashing (SSVH) models learn to generate short binary representations for videos without ground-truth supervision, facilitating large-scale video retrieval efficiency and attracting increasing research attention. The success of SSVH lies in the understanding of video content and the ability to capture the semantic relation among unlabeled videos. Typically, state-of-the-art SSVH methods consider these two points in a two-stage training pipeline, where they firstly train an auxiliary network by instance-wise mask-and-predict tasks and secondly train a hashing model to preserve the pseudo-neighborhood structure transferred from the auxiliary network. This consecutive training strategy is inflexible and also unnecessary. In this paper, we propose a simple yet effective one-stage SSVH method called ConMH, which incorporates video semantic information and video similarity relationship understanding in a single stage. To capture video semantic information for better hashing learning, we adopt an encoder-decoder structure to reconstruct the video from its temporal-masked frames. Particularly, we find that a higher masking ratio helps video understanding. Besides, we fully exploit the similarity relationship between videos by maximizing agreement between two augmented views of a video, which contributes to more discriminative and robust hash codes. Extensive experiments on three large-scale video datasets (i.e., FCVID, ActivityNet and YFCC) indicate that ConMH achieves state-of-the-art results. Code is available at https://github.com/huangmozhi9527/ConMH.

下载PDF全文

下载文献需遵守相关版权规定

论文标题