通过卷积架构搜索的视觉变压器

论文标题

通过卷积架构搜索的视觉变压器

Vision Transformer with Convolutions Architecture Search

论文作者

Zhang, Haichao, Hao, Kuangrong, Pedrycz, Witold, Gao, Lei, Tang, Xuesong, Wei, Bing

论文摘要

变形金刚在处理计算机视觉任务方面具有很大的优势。他们通过利用多头注意机制来处理一系列由拆分图像组成的补丁来对图像分类任务进行建模。但是，对于复杂的任务，计算机视觉中的变压器不仅需要继承一些动态的关注和全局环境，而且还需要引入有关降低噪声，转移和缩放对象的不变性的功能。因此，在这里，我们向前迈出了一步，研究变压器和卷积的结构特征，并通过卷积架构搜索（VTCAS）提出了一个体系结构搜索方法 - 视觉变压器。 VTCA搜索的高性能骨干网络将卷积神经网络的理想特征引入了变压器体系结构，同时保持了多头注意机制的好处。搜索的基于块的骨干网络可以在不同尺度上提取特征地图。这些功能与更广泛的视觉任务兼容，例如图像分类（32 M参数，Imagenet-1K上的82.0％Top-1精度）和对象检测（COCO2017上的50.4％MAP）。提出的拓扑基于多头注意机制和CNN适应性地将像素的关系特征与对象的多尺度特征相关联。它增强了神经网络的鲁棒性，以识别对象识别，尤其是在低照明室内场景中。

Transformers exhibit great advantages in handling computer vision tasks. They model image classification tasks by utilizing a multi-head attention mechanism to process a series of patches consisting of split images. However, for complex tasks, Transformer in computer vision not only requires inheriting a bit of dynamic attention and global context, but also needs to introduce features concerning noise reduction, shifting, and scaling invariance of objects. Therefore, here we take a step forward to study the structural characteristics of Transformer and convolution and propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS). The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture while maintaining the benefits of the multi-head attention mechanism. The searched block-based backbone network can extract feature maps at different scales. These features are compatible with a wider range of visual tasks, such as image classification (32 M parameters, 82.0% Top-1 accuracy on ImageNet-1K) and object detection (50.4% mAP on COCO2017). The proposed topology based on the multi-head attention mechanism and CNN adaptively associates relational features of pixels with multi-scale features of objects. It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.

下载PDF全文

下载文献需遵守相关版权规定

论文标题