VSA：在视觉变压器中学习各种大小的窗户注意力

论文标题

VSA：在视觉变压器中学习各种大小的窗户注意力

VSA: Learning Varied-Size Window Attention in Vision Transformers

论文作者

Zhang, Qiming, Xu, Yufei, Zhang, Jing, Tao, Dacheng

论文摘要

在视觉变压器中广泛探索了Windows中的注意力，以平衡性能，计算复杂性和内存足迹。但是，当前的模型采用了手工制作的固定尺寸窗口设计，这限制了其对长期依赖性建模并适应不同尺寸对象的能力。为了解决这个缺点，我们提出\ textbf {v} aried- \ textbf {s} ize window \ textbf {a} ttention（vsa），以从数据中学习自适应窗口配置。具体而言，基于每个默认窗口中的令牌，VSA使用一个窗口回归模块来预测目标窗口的大小和位置，即对键和值代币进行采样的注意力区域。通过为每个注意力负责人独立采用VSA，它可以建模长期依赖性，从不同窗口中捕获丰富的上下文，并促进重叠窗口之间的信息交换。 VSA是一个易于实现的模块，可以用较小的修改和可忽略的额外计算成本来代替最先进的代表性模型中的窗户注意力，同时将其性能提高了很大的利润率，例如，Imagenet分类的SWIN-T 1.1 \％。此外，使用较大图像进行训练和测试时，性能增益会增加。对更下游任务的实验结果，包括对象检测，实例分割和语义分割，进一步证明了VSA在处理不同大小的对象时的优越性。该代码将发布https://github.com/vitae-transformer/vitae-vsa。

Attention within windows has been widely explored in vision transformers to balance the performance, computation complexity, and memory footprint. However, current models adopt a hand-crafted fixed-size window design, which restricts their capacity of modeling long-term dependencies and adapting to objects of different sizes. To address this drawback, we propose \textbf{V}aried-\textbf{S}ize Window \textbf{A}ttention (VSA) to learn adaptive window configurations from data. Specifically, based on the tokens within each default window, VSA employs a window regression module to predict the size and location of the target window, i.e., the attention area where the key and value tokens are sampled. By adopting VSA independently for each attention head, it can model long-term dependencies, capture rich context from diverse windows, and promote information exchange among overlapped windows. VSA is an easy-to-implement module that can replace the window attention in state-of-the-art representative models with minor modifications and negligible extra computational cost while improving their performance by a large margin, e.g., 1.1\% for Swin-T on ImageNet classification. In addition, the performance gain increases when using larger images for training and test. Experimental results on more downstream tasks, including object detection, instance segmentation, and semantic segmentation, further demonstrate the superiority of VSA over the vanilla window attention in dealing with objects of different sizes. The code will be released https://github.com/ViTAE-Transformer/ViTAE-VSA.

下载PDF全文

下载文献需遵守相关版权规定

论文标题