揭示蒙版图像建模的黑暗秘密

论文标题

揭示蒙版图像建模的黑暗秘密

Revealing the Dark Secrets of Masked Image Modeling

论文作者

Xie, Zhenda, Geng, Zigang, Hu, Jingcheng, Zhang, Zheng, Hu, Han, Cao, Yue

论文摘要

作为预训练的蒙版图像建模（MIM）被证明对许多视觉下游任务有效，但是MIM的工作方式和何处仍不清楚。在本文中，我们将MIM与长期以来的监督预训练的模型从两个角度（可视化和实验）进行了比较，以发现它们的关键表示差异。从可视化中，我们发现MIM带来了局部归纳偏置的训练模型的所有层，但是监督模型倾向于本地集中在较低的层，而在全球范围内则更高。这可能就是MIM帮助具有非常大的接收场要优化的视觉变压器的原因。使用MIM，该模型可以在所有层中的注意力头上保持较大的多样性。但是对于有监督的模型，注意力头的多样性几乎从最近的三层中消失了，较少的多样性会损害微调的性能。从实验中，我们发现MIM模型在具有弱语义或细粒分类任务的几何和运动任务上的性能要比其受监督的对应物更好。如果没有铃铛和吹口哨，标准的MIM预先训练的SWINV2-L可以在姿势估计上实现最先进的性能（可可测试-DEV上的78.9 AP，在人群中的78.0 AP），深度估计（NYUV2上的0.287 RMSE在NYUV2上的RMSE和KITTI上的1.966 RMSE）和视频对象跟踪（70.7 Suc）（70.7 Suc）。对于语义理解数据集，该类别被监督的预训练充分涵盖了类别，MIM模型仍然可以实现高度竞争性的转移性能。有了对MIM的深入了解，我们希望我们的工作能够激发这一方向的新的和扎实的研究。

Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题