论文标题
重新思考预先训练的透明变压器中的层次结构
Rethinking Hierarchies in Pre-trained Plain Vision Transformer
论文作者
论文摘要
通过掩盖图像建模(MIM),自我监督的预训练前视觉变压器(VIT)已被证明非常有效。但是,定制算法应针对层次VIT进行仔细设计,例如Greenmim,而不是使用香草和简单的MAE作为普通Vit。更重要的是,由于这些层次结构VIT无法重复使用普通VIT的现成的预训练的权重,因此预训练的要求会导致大量的计算成本,从而产生算法和计算复杂性。在本文中,我们通过提出一种新颖的想法来解决这个问题,以将分层架构设计与自我监督的预训练相关。我们将普通的VIT转变为具有最小变化的层次结构。从技术上讲,我们将线性嵌入层的步幅从16更改为4,并在变压器块之间添加卷积(或简单的平均)池层,从而将特征大小从1/4降低到1/32。尽管它很简单,但在ImageNet,Coco,Coco,CityScapes和ADE20K基准的分类,检测和分割任务中,它的表现优于普通VIT基线。我们希望这项初步研究可以吸引社区对发展有效(等级)VIT的更多关注,同时通过利用现成的检查站来避免培训前成本。代码和模型将在https://github.com/vitae-transformer/hpvit上发布。
Self-supervised pre-training vision transformer (ViT) via masked image modeling (MIM) has been proven very effective. However, customized algorithms should be carefully designed for the hierarchical ViTs, e.g., GreenMIM, instead of using the vanilla and simple MAE for the plain ViT. More importantly, since these hierarchical ViTs cannot reuse the off-the-shelf pre-trained weights of the plain ViTs, the requirement of pre-training them leads to a massive amount of computational cost, thereby incurring both algorithmic and computational complexity. In this paper, we address this problem by proposing a novel idea of disentangling the hierarchical architecture design from the self-supervised pre-training. We transform the plain ViT into a hierarchical one with minimal changes. Technically, we change the stride of linear embedding layer from 16 to 4 and add convolution (or simple average) pooling layers between the transformer blocks, thereby reducing the feature size from 1/4 to 1/32 sequentially. Despite its simplicity, it outperforms the plain ViT baseline in classification, detection, and segmentation tasks on ImageNet, MS COCO, Cityscapes, and ADE20K benchmarks, respectively. We hope this preliminary study could draw more attention from the community on developing effective (hierarchical) ViTs while avoiding the pre-training cost by leveraging the off-the-shelf checkpoints. The code and models will be released at https://github.com/ViTAE-Transformer/HPViT.