蒙版的图像建模与降解对比度

论文标题

蒙版的图像建模与降解对比度

Masked Image Modeling with Denoising Contrast

论文作者

Yi, Kun, Ge, Yixiao, Li, Xiaotong, Yang, Shusheng, Li, Dian, Wu, Jianping, Shan, Ying, Qie, Xiaohu

论文摘要

自从从对比度学习到掩盖图像建模（MIM）的自我监督的视觉表示学习的发展，因此本质上没有显着差异，即如何为视觉字典查找设计正确的借口任务。 MIM最近以视觉变压器（VIT）的最新性能（VITS）统治了这一研究，其中核心是通过降解自动编码机制来增强网络捕获的贴片级视觉上下文。我们没有像以前的作品那样定制具有额外训练阶段的图像令牌，而是释放了对对比度学习的巨大潜力，以剥夺自动编码，并引入纯MIM方法ConMim，以产生简单的内图内对比度对比度约束，作为掩盖贴剂预测的唯一学习对象。我们通过不对称设计（包括图像扰动和模型进度速率）进一步加强了脱氧机制，以改善网络预训练。具有各种量表的CONMIM预测模型在下游图像分类，语义分割，对象检测和实例分割任务（例如，在Imagenet-1K分类上，我们实现了Vit-Small的Top-1精度，而使用VIT-BASE的85.3％，而无需进行预先培训的VITBASE，我们实现了83.9％的TOP-1准确性。

Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling (MIM), there is no significant difference in essence, that is, how to design proper pretext tasks for vision dictionary look-up. MIM recently dominates this line of research with state-of-the-art performance on vision Transformers (ViTs), where the core is to enhance the patch-level visual context capturing of the network via denoising auto-encoding mechanism. Rather than tailoring image tokenizers with extra training stages as in previous works, we unleash the great potential of contrastive learning on denoising auto-encoding and introduce a pure MIM method, ConMIM, to produce simple intra-image inter-patch contrastive constraints as the sole learning objectives for masked patch prediction. We further strengthen the denoising mechanism with asymmetric designs, including image perturbations and model progress rates, to improve the network pre-training. ConMIM-pretrained models with various scales achieve competitive results on downstream image classification, semantic segmentation, object detection, and instance segmentation tasks, e.g., on ImageNet-1K classification, we achieve 83.9% top-1 accuracy with ViT-Small and 85.3% with ViT-Base without extra data for pre-training.

下载PDF全文

下载文献需遵守相关版权规定

论文标题