论文标题
转化器:通过多分辨率培训进行扩展VIT
ResFormer: Scaling ViTs with Multi-Resolution Training
论文作者
论文摘要
视觉变压器(VIT)取得了压倒性的成功,但它们却遭受了脆弱的分辨率可伸缩性,即,在介绍训练期间看不见的输入分辨率时,性能会大幅下降。我们介绍了一个框架,该框架建立在多分辨率培训的开创性概念上,以改善各种各样的(大多数看不见的测试决议)的性能。尤其是,转化器在不同分辨率的复制图像上运行,并强制执行量表一致性损失,以在不同尺度上参与交互信息。更重要的是,要有效地在不同的分辨率中进行交替,尤其是在测试中的新颖决议中,我们提出了一种全部本地位置嵌入策略,该策略在输入尺寸的情况下顺利进行。我们进行了广泛的实验,以实现图像分类。该结果提供了有力的定量证据,表明转化器具有有希望的缩放能力,以实现广泛的决议。例如,在相对低和高分辨率(即96和640)评估时,Resformer-B-MR的前1级准确性为75.86%和81.72%,比DEIT-B好48%和7.49%。此外,我们还证明了转化器是灵活的,并且可以轻松扩展到语义细分,对象检测和视频动作识别。代码可从https://github.com/ruitian12/resformer获得。
Vision Transformers (ViTs) have achieved overwhelming success, yet they suffer from vulnerable resolution scalability, i.e., the performance drops drastically when presented with input resolutions that are unseen during training. We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions. In particular, ResFormer operates on replicated images of different resolutions and enforces a scale consistency loss to engage interactive information across different scales. More importantly, to alternate among varying resolutions effectively, especially novel ones in testing, we propose a global-local positional embedding strategy that changes smoothly conditioned on input sizes. We conduct extensive experiments for image classification on ImageNet. The results provide strong quantitative evidence that ResFormer has promising scaling abilities towards a wide range of resolutions. For instance, ResFormer-B-MR achieves a Top-1 accuracy of 75.86% and 81.72% when evaluated on relatively low and high resolutions respectively (i.e., 96 and 640), which are 48% and 7.49% better than DeiT-B. We also demonstrate, moreover, ResFormer is flexible and can be easily extended to semantic segmentation, object detection and video action recognition. Code is available at https://github.com/ruitian12/resformer.