论文标题
部分可观测时空混沌系统的无模型预测
Exploring The Role of Mean Teachers in Self-supervised Masked Auto-Encoders
论文作者
论文摘要
蒙面图像建模(MIM)已成为具有视觉变压器的视觉表示自我监督学习〜(SSL)的流行策略。蒙版自动编码器(MAE)是代表性的MIM模型,随机掩盖了图像补丁的子集,并重建了给定未掩盖的贴片的掩蔽贴片。同时,许多最新的自我监管学习中的著作都利用了学生/老师范式,该范式根据由以前学生的指数移动平均平均值(EMA)组成的教师的输出为学生提供了一个额外的目标。尽管很常见,但对学生与老师之间相互作用的动态知之甚少。通过对简单线性模型的分析,我们发现教师根据特征相似性有条件地删除了以前的梯度方向,这些特征相似性有效地充当有条件的动量正常器。通过此分析,我们提出了一种简单的SSL方法,即通过将EMA老师添加到MAE中,即重建一致的蒙面自动编码器(RC-MAE)。我们发现,RC-MAE的收敛速度比预训练期间的最先进的自我鉴定方法更快,并且需要更少的内存使用量,这可能提供了一种增强对视觉变压器模型学习过于昂贵的自我监督的实用性的方法。此外,我们表明,与MAE相比,在下游任务(例如Imagenet-1K分类,对象检测和实例分段)上,RC-MAE与MAE相比实现了更多的鲁棒性和更好的性能。
Masked image modeling (MIM) has become a popular strategy for self-supervised learning~(SSL) of visual representations with Vision Transformers. A representative MIM model, the masked auto-encoder (MAE), randomly masks a subset of image patches and reconstructs the masked patches given the unmasked patches. Concurrently, many recent works in self-supervised learning utilize the student/teacher paradigm which provides the student with an additional target based on the output of a teacher composed of an exponential moving average (EMA) of previous students. Although common, relatively little is known about the dynamics of the interaction between the student and teacher. Through analysis on a simple linear model, we find that the teacher conditionally removes previous gradient directions based on feature similarities which effectively acts as a conditional momentum regularizer. From this analysis, we present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE. We find that RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training, which may provide a way to enhance the practicality of prohibitively expensive self-supervised learning of Vision Transformer models. Additionally, we show that RC-MAE achieves more robustness and better performance compared to MAE on downstream tasks such as ImageNet-1K classification, object detection, and instance segmentation.