论文标题
通过感知理解改善视觉表示学习
Improving Visual Representation Learning through Perceptual Understanding
论文作者
论文摘要
我们提出了蒙面自动编码器(MAE)的扩展,该扩展通过明确鼓励学习更高场景级别的功能来改进模型所学的表示形式。我们这样做是通过:(i)引入感知相似性项(ii)(ii)(ii)结合了对抗性训练文献中的几种技术,包括多规模训练和自适应歧视者的增强。这些结果的组合不仅在更好的像素重建中,而且在图像中似乎可以捕获更好的高级细节的表示形式。更重要的是,我们展示了我们的方法,感知的MAE如何在用于下游任务的情况下会带来更好的性能。在微调时,我们在Imagenet-1k上实现了78.1%的TOP-1精度线性探测,最高为88.1%,在其他下游任务中获得相似的结果,所有这些都不使用其他预训练的模型或数据。
We present an extension to masked autoencoders (MAE) which improves on the representations learnt by the model by explicitly encouraging the learning of higher scene-level features. We do this by: (i) the introduction of a perceptual similarity term between generated and real images (ii) incorporating several techniques from the adversarial training literature including multi-scale training and adaptive discriminator augmentation. The combination of these results in not only better pixel reconstruction but also representations which appear to capture better higher-level details within images. More consequentially, we show how our method, Perceptual MAE, leads to better performance when used for downstream tasks outperforming previous methods. We achieve 78.1% top-1 accuracy linear probing on ImageNet-1K and up to 88.1% when fine-tuning, with similar results for other downstream tasks, all without use of additional pre-trained models or data.