论文标题
I-MAE:蒙版自动编码器中的潜在表示是否可以线性分开?
i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable?
论文作者
论文摘要
蒙面图像建模(MIM)已被认为是视觉域中强大的自我监督预训练方法。但是,通过这种方案,学到的表示形式的机制和特性,以及如何进一步增强表示形式,到目前为止还没有得到充分探索。在本文中,我们旨在探索一个交互式掩盖的自动编码器(I-MAE)框架,以增强从两个方面的表现能力:(1)采用双向图像重建和具有蒸馏损失的潜在特征重建,以学习更好的功能; (2)提出一种语义增强的采样策略,以增强MAE中博学的语义。根据提出的I-MAE架构,我们可以解决两个关键问题,以探索MAE中学到的表示的行为:(1)蒙版自动编码器中潜在的可分离性是否有助于模型性能?我们通过强迫输入作为两个图像而不是一个图像的混合物来研究它。 (2)我们是否可以通过控制蒙版自动编码器采样时的语义程度来增强潜在特征空间中的表示形式?为此,我们根据训练样本的语义提出了一个小批次中的采样策略,以检查这一方面。广泛的实验是在CIFAR-10/100,Tiny-Imagenet和Imagenet-1K上进行的,以验证我们发现的观察结果。此外,除了定性地分析潜在表示的特征外,我们还通过提出两个评估方案来研究线性可分离性的存在和潜在空间中语义程度的存在。令人惊讶且一致的结果表明,I-MAE是理解MAE框架的卓越框架设计,并具有更好的代表性能力。代码可在https://github.com/vision-learning-acceleration-lab/i-mae中找到。
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain. However, the mechanism and properties of the learned representations by such a scheme, as well as how to further enhance the representations are so far not well-explored. In this paper, we aim to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability from two aspects: (1) employing a two-way image reconstruction and a latent feature reconstruction with distillation loss to learn better features; (2) proposing a semantics-enhanced sampling strategy to boost the learned semantics in MAE. Upon the proposed i-MAE architecture, we can address two critical questions to explore the behaviors of the learned representations in MAE: (1) Whether the separability of latent representations in Masked Autoencoders is helpful for model performance? We study it by forcing the input as a mixture of two images instead of one. (2) Whether we can enhance the representations in the latent feature space by controlling the degree of semantics during sampling on Masked Autoencoders? To this end, we propose a sampling strategy within a mini-batch based on the semantics of training samples to examine this aspect. Extensive experiments are conducted on CIFAR-10/100, Tiny-ImageNet and ImageNet-1K to verify the observations we discovered. Furthermore, in addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space by proposing two evaluation schemes. The surprising and consistent results demonstrate that i-MAE is a superior framework design for understanding MAE frameworks, as well as achieving better representational ability. Code is available at https://github.com/vision-learning-acceleration-lab/i-mae.