如何在小型数据集上训练视觉变压器？

论文标题

如何在小型数据集上训练视觉变压器？

How to Train Vision Transformer on Small-scale Datasets?

论文作者

Gani, Hanan, Naseer, Muzammal, Yaqub, Mohammad

论文摘要

视觉变压器（VIT）是一种与卷积神经网络完全不同的架构，提供了多种优势，包括设计简单，鲁棒性和在许多视觉任务上的最新性能。但是，与卷积神经网络相反，视觉变压器缺乏固有的感应偏见。因此，对此类模型的成功培训主要归因于在大规模数据集上的预训练，例如具有1.2m的Imagenet或具有300m图像的JFT。这阻碍了小型数据集的视觉变压器的直接适应。在这项工作中，我们表明可以直接从小型数据集中学习自我监督的归纳偏见，并作为微调的有效权重初始化方案。这允许在没有大规模预训练的情况下训练这些模型，更改模型架构或损失功能。我们提供了彻底的实验，以成功训练五个小数据集上的整体和非石器时代视觉变压器，包括CIFAR10/100，CINIC10，SVHN，SVHN，Tiny-Imagenet和两个细粒度数据集：飞机和汽车。我们的方法始终如一地提高视觉变压器的性能，同时保留其特性，例如关注显着区域和更高的鲁棒性。我们的代码和预培训模型可在以下网址找到：https：//github.com/hananshafi/vits-for-small-scale-datasets。

Vision Transformer (ViT), a radically different architecture than convolutional neural networks offers multiple advantages including design simplicity, robustness and state-of-the-art performance on many vision tasks. However, in contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases. Therefore, successful training of such models is mainly attributed to pre-training on large-scale datasets such as ImageNet with 1.2M or JFT with 300M images. This hinders the direct adaption of Vision Transformer for small-scale datasets. In this work, we show that self-supervised inductive biases can be learned directly from small-scale datasets and serve as an effective weight initialization scheme for fine-tuning. This allows to train these models without large-scale pre-training, changes to model architecture or loss functions. We present thorough experiments to successfully train monolithic and non-monolithic Vision Transformers on five small datasets including CIFAR10/100, CINIC10, SVHN, Tiny-ImageNet and two fine-grained datasets: Aircraft and Cars. Our approach consistently improves the performance of Vision Transformers while retaining their properties such as attention to salient regions and higher robustness. Our codes and pre-trained models are available at: https://github.com/hananshafi/vits-for-small-scale-datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题