焦点调制网络

论文标题

焦点调制网络

Focal Modulation Networks

论文作者

Yang, Jianwei, Li, Chunyuan, Dai, Xiyang, Yuan, Lu, Gao, Jianfeng

论文摘要

我们提出了焦点调制网络（简称焦点），其中自我注意力（SA）被焦点调制机制完全取代，用于建模视觉中的令牌相互作用。焦点调制包括三个组成部分：（i）使用一堆深度卷积层实施的层次结构上下文化，以编码从短到长范围的视觉上下文，（ii）门控聚合以选择性地收集基于每个查询令牌的上下文，以此为基础收集每个查询的上下文。内容，以及（iii）元素的调制或仿射转换，将汇总上下文注入查询。广泛的实验表明，焦点表现优于最先进的SA对应物（例如SWIN和Focal Transformers），其计算成本在图像分类，对象检测和分段的任务上具有相似的计算成本。具体而言，具有微小和基础尺寸的焦点在Imagenet-1k上具有82.3％和83.9％的TOP-1精度。在224分辨率的Imagenet-22K上预估计后，分别以224和384的分辨率进行了鉴定时，它的前1精度为86.5％和87.3％。当转移到下游任务时，Focalnets表现出明显的优势。为了使用蒙版R-CNN检测对象检测，对1 \ times训练的Focalnet基座的表现优于2.1分，并且已经超过了经过3 \ times计划的Swin（49.0 V.S. 48.5）。对于使用Upernet的语义细分，单尺度上的焦点基底量优于2.4，并在多尺度上击败Swin（50.5 V.S. 49.7）。使用大型焦点和蒙版2Former，我们实现了58.5 MIOU用于ADE20K语义分割，而可可泛型分割的57.9 PQ。使用巨大的焦点和恐龙，我们分别获得了64.3和64.4地图，分别在Coco Minival和Test-DEV上获得了映射，在基于Swinv2-G和Beit-3（例如Swinv2-G和Beit-3）上建立了新的SOTA。代码和检查点可在https://github.com/microsoft/focalnet上找到。

We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K in 224 resolution, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 224 and 384, respectively. When transferred to downstream tasks, FocalNets exhibit clear superiority. For object detection with Mask R-CNN, FocalNet base trained with 1\times outperforms the Swin counterpart by 2.1 points and already surpasses Swin trained with 3\times schedule (49.0 v.s. 48.5). For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2.4, and beats Swin at multi-scale (50.5 v.s. 49.7). Using large FocalNet and Mask2former, we achieve 58.5 mIoU for ADE20K semantic segmentation, and 57.9 PQ for COCO Panoptic Segmentation. Using huge FocalNet and DINO, we achieved 64.3 and 64.4 mAP on COCO minival and test-dev, respectively, establishing new SoTA on top of much larger attention-based models like Swinv2-G and BEIT-3. Code and checkpoints are available at https://github.com/microsoft/FocalNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题