论文标题

FastMim:加快视力的蒙版图像建模预训练

FastMIM: Expediting Masked Image Modeling Pre-training for Vision

论文作者

Guo, Jianyuan, Han, Kai, Wu, Han, Tang, Yehui, Wang, Yunhe, Xu, Chang

论文摘要

变压器和蒙版图像建模(MIM)的结合培训框架在各种视觉任务中都表现出巨大的潜力。但是,培训前的计算预算太重了,并阻止MIM成为实用的培训范式。本文介绍了FastMim,这是一个简单而通用的框架,用于加快蒙面图像建模,并通过以下两个步骤:(i)具有低分辨率输入图像的训练前视觉骨架; (ii)重建定向梯度(HOG)特征的直方图,而不是输入图像的原始RGB值。此外,我们提出了FastMIM-P,以逐步扩大训练阶段的输入分辨率,以进一步增强高容量模型的转移结果。我们指出:(i)在训练阶段的各种输入分辨率可以导致在微调阶段和下游任务(例如检测和分割)中相似的性能; (ii)编码器的浅层层在预训练和丢弃最后几层期间更重要的是可以加快训练阶段,而不会损害微调性能; (iii)解码器应匹配所选网络的大小; (iv)当分辨率转移时,猪比RGB值更稳定;配备了FastMim,可以有效地预先训练各种视觉骨干。例如,我们可以在Imagenet-1k上以VIT-B/SWIN-B为骨架上的Imagenet-1K上获得83.8%/84.1%的TOP-1精度。与以前的相关方法相比,我们可以实现可比较或更好的TOP-1精度,同时加速培训程序$ \ sim $ 5 $ \ times $。代码可以在https://github.com/ggjy/fastmim.pytorch中找到。

The combination of transformers and masked image modeling (MIM) pre-training framework has shown great potential in various vision tasks. However, the pre-training computational budget is too heavy and withholds the MIM from becoming a practical training paradigm. This paper presents FastMIM, a simple and generic framework for expediting masked image modeling with the following two steps: (i) pre-training vision backbones with low-resolution input images; and (ii) reconstructing Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images. In addition, we propose FastMIM-P to progressively enlarge the input resolution during pre-training stage to further enhance the transfer results of models with high capacity. We point out that: (i) a wide range of input resolutions in pre-training phase can lead to similar performances in fine-tuning phase and downstream tasks such as detection and segmentation; (ii) the shallow layers of encoder are more important during pre-training and discarding last several layers can speed up the training stage with no harm to fine-tuning performance; (iii) the decoder should match the size of selected network; and (iv) HOG is more stable than RGB values when resolution transfers;. Equipped with FastMIM, all kinds of vision backbones can be pre-trained in an efficient way. For example, we can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones. Compared to previous relevant approaches, we can achieve comparable or better top-1 accuracy while accelerate the training procedure by $\sim$5$\times$. Code can be found in https://github.com/ggjy/FastMIM.pytorch.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源