论文标题
统一者:统一卷积和自我注意以视觉识别
UniFormer: Unifying Convolution and Self-attention for Visual Recognition
论文作者
论文摘要
从图像和视频中学习判别性表示,这是一项具有挑战性的任务,这是由于这些视觉数据中的局部冗余和复杂的全球依赖性。在过去的几年中,卷积神经网络(CNN)和视觉变压器(VIT)一直是两个主导框架。尽管CNN可以通过在小社区内的卷积有效地降低本地冗余,但有限的接受场使得很难捕获全球依赖性。另外,VIT可以通过自我注意力有效地捕获长期依赖性,而所有令牌之间的盲目相似性比较会导致高冗余。为了解决这些问题,我们提出了一种新型的统一变压器(统一器),该变压器可以无缝地以简洁的变压器格式将卷积和自我注意的优点整合在一起。与典型的变压器块不同,我们统一器块中的关系汇总器分别配备了本地和全球令牌亲和力,在浅层和深层中,可以解决冗余和依赖性,以进行有效的表示。最后,我们将统一器块灵活地将其堆放到一个新的强大主链中,并采用从图像到视频域的各种视觉任务,从分类到密集的预测。如果没有任何额外的训练数据,我们的统一人在Imagenet-1k分类中实现了86.3 TOP-1的准确性。 With only ImageNet-1K pre-training, it can simply achieve state-of-the-art performance in a broad range of downstream tasks, e.g., it obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, 60.9/71.2 top-1 accuracy on Sth-Sth V1/V2 video classification, 53.8 box AP and 46.4 mask AP on COCO object detection, 50.8 mIoU on ADE20K语义分割和可可姿势估计的77.4 AP。我们进一步构建了一个高吞吐量2-4x的有效统一器。代码可从https://github.com/sense-x/uniformer获得。
It is a challenging task to learn discriminative representation from images and videos, due to large local redundancy and complex global dependency in these visual data. Convolution neural networks (CNNs) and vision transformers (ViTs) have been two dominant frameworks in the past few years. Though CNNs can efficiently decrease local redundancy by convolution within a small neighborhood, the limited receptive field makes it hard to capture global dependency. Alternatively, ViTs can effectively capture long-range dependency via self-attention, while blind similarity comparisons among all the tokens lead to high redundancy. To resolve these problems, we propose a novel Unified transFormer (UniFormer), which can seamlessly integrate the merits of convolution and self-attention in a concise transformer format. Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing to tackle both redundancy and dependency for efficient and effective representation learning. Finally, we flexibly stack our UniFormer blocks into a new powerful backbone, and adopt it for various vision tasks from image to video domain, from classification to dense prediction. Without any extra training data, our UniFormer achieves 86.3 top-1 accuracy on ImageNet-1K classification. With only ImageNet-1K pre-training, it can simply achieve state-of-the-art performance in a broad range of downstream tasks, e.g., it obtains 82.9/84.8 top-1 accuracy on Kinetics-400/600, 60.9/71.2 top-1 accuracy on Sth-Sth V1/V2 video classification, 53.8 box AP and 46.4 mask AP on COCO object detection, 50.8 mIoU on ADE20K semantic segmentation, and 77.4 AP on COCO pose estimation. We further build an efficient UniFormer with 2-4x higher throughput. Code is available at https://github.com/Sense-X/UniFormer.