当Shift Operation符合视觉变压器时：注意机制的一种非常简单的替代方案

论文标题

当Shift Operation符合视觉变压器时：注意机制的一种非常简单的替代方案

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

论文作者

Wang, Guangting, Zhao, Yucheng, Tang, Chuanxin, Luo, Chong, Zeng, Wenjun

论文摘要

注意机制被广泛认为是视觉变形金刚（VIT）成功的关键，因为它为建模空间关系提供了一种灵活而有力的方式。但是，注意机制真的是VIT必不可少的部分吗？可以用其他替代方案代替吗？为了揭开注意机制的作用，我们将其简化为一个极为简单的情况：零失败和零参数。具体而言，我们重新审视了轮班操作。它不包含任何参数或算术计算。唯一的操作是在相邻特征之间交换一小部分通道。基于这个简单的操作，我们构建了一个新的骨干网络，即ShiftVit，其中VIT中的注意力层被移动操作替代。令人惊讶的是，ShiftVit在几个主流任务中效果很好，例如分类，检测和分割。性能与强大的基线Swin Transformer相当甚至更好。这些结果表明，注意机制可能不是使VIT成功的重要因素。甚至可以用零参数操作代替。我们应该在将来的工作中对VIT的其余部分进行更多关注。代码可在github.com/microsoft/spach上找到。

Attention mechanism has been widely believed as the key to success of vision transformers (ViTs), since it provides a flexible and powerful way to model spatial relationships. However, is the attention mechanism truly an indispensable part of ViT? Can it be replaced by some other alternatives? To demystify the role of attention mechanism, we simplify it into an extremely simple case: ZERO FLOP and ZERO parameter. Concretely, we revisit the shift operation. It does not contain any parameter or arithmetic calculation. The only operation is to exchange a small portion of the channels between neighboring features. Based on this simple operation, we construct a new backbone network, namely ShiftViT, where the attention layers in ViT are substituted by shift operations. Surprisingly, ShiftViT works quite well in several mainstream tasks, e.g., classification, detection, and segmentation. The performance is on par with or even better than the strong baseline Swin Transformer. These results suggest that the attention mechanism might not be the vital factor that makes ViT successful. It can be even replaced by a zero-parameter operation. We should pay more attentions to the remaining parts of ViT in the future work. Code is available at github.com/microsoft/SPACH.

下载PDF全文

下载文献需遵守相关版权规定

论文标题