米兰：在语言辅助代表方面预处理图像

论文标题

米兰：在语言辅助代表方面预处理图像

MILAN: Masked Image Pretraining on Language Assisted Representation

论文作者

Hou, Zejiang, Sun, Fei, Chen, Yen-Kuang, Xie, Yuan, Kung, Sun-Yuan

论文摘要

在过去的几年中，基于自我注意力的变压器模型一直在许多计算机视觉任务中占主导地位。他们的出色模型质量在很大程度上取决于标记过多的图像数据集。为了减少对大型标记数据集的依赖，基于重建的掩盖自动编码器正在越来越受欢迎，这些自动编码器从未标记的图像中学习了高质量的可转移表示形式。出于同样的目的，最近弱监督的图像预处理方法探索了图像随附的文本字幕的语言监督。在这项工作中，我们提出了对语言辅助表示形式进行预读的蒙版图像，称为米兰。我们的预处理目标不是预测原始像素或低级别的特征，而是用使用字幕监督获得的大量语义信号来重建图像特征。此外，为了适应我们的重建目标，我们提出了一个更有效的提示解码器架构和语义意识到的掩码采样机制，从而进一步提高了预审预告额的模型的传输性能。实验结果表明，米兰的精度比以前的工作更高。当对蒙版的自动编码器进行预估计并在ImageNet-1K数据集上进行填充，输入分辨率为224x224时，米兰在VIT基本上获得了85.4％的前1位准确性，使以前的先前最先前的艺术品的准确性提高了1％。在下游的语义分割任务中，米兰在ADE20K数据集上使用VIT-BASE实现了52.7 MIOU，表现优于先前的蒙版预读结果4点。

Self-attention based transformer models have been dominating many computer vision tasks in the past few years. Their superb model qualities heavily depend on the excessively large labeled image datasets. In order to reduce the reliance on large labeled datasets, reconstruction based masked autoencoders are gaining popularity, which learn high quality transferable representations from unlabeled images. For the same purpose, recent weakly supervised image pretraining methods explore language supervision from text captions accompanying the images. In this work, we propose masked image pretraining on language assisted representation, dubbed as MILAN. Instead of predicting raw pixels or low level features, our pretraining objective is to reconstruct the image features with substantial semantic signals that are obtained using caption supervision. Moreover, to accommodate our reconstruction target, we propose a more effective prompting decoder architecture and a semantic aware mask sampling mechanism, which further advance the transfer performance of the pretrained model. Experimental results demonstrate that MILAN delivers higher accuracy than the previous works. When the masked autoencoder is pretrained and finetuned on ImageNet-1K dataset with an input resolution of 224x224, MILAN achieves a top-1 accuracy of 85.4% on ViT-Base, surpassing previous state-of-the-arts by 1%. In the downstream semantic segmentation task, MILAN achieves 52.7 mIoU using ViT-Base on ADE20K dataset, outperforming previous masked pretraining results by 4 points.

下载PDF全文

下载文献需遵守相关版权规定

论文标题