迈向基于变压器的对象检测

论文标题

迈向基于变压器的对象检测

Toward Transformer-Based Object Detection

论文作者

Beal, Josh, Kim, Eric, Tzeng, Eric, Park, Dong Huk, Zhai, Andrew, Kislyuk, Dmitry

论文摘要

由于它们能够预识大量数据，然后通过微调转移到较小的，更具体的任务，因此变压器已成为自然语言处理中的主要模型。视觉变压器是将纯变压器模型直接应用于图像作为输入的第一次主要尝试，表明与卷积网络相比，基于变压器的架构可以在基准分类任务上实现竞争成果。但是，注意操作员的计算复杂性意味着我们仅限于低分辨率输入。对于更复杂的任务，例如检测或分割，保持高输入分辨率对于确保模型可以正确识别并反映其输出中的细节至关重要。这自然提出了一个问题，即是否基于变压器的体系结构（例如Vision Transformer）能够执行除分类以外的其他任务。在本文中，我们确定视力变压器可以通过常见的检测任务头将视力变形物用作骨干，以产生竞争性可可结果。我们提出的模型VIT-FRCNN展示了与变压器相关的几种已知特性，包括较大的预训练能力和快速的微调性能。我们还研究了对标准检测主链的改进，包括在室外图像上的出色性能，在大物体上的更好性能以及对非最大抑制作用的依赖减少。我们将VIT-FRCNN视为重要的垫脚石，朝着复杂视觉任务（例如对象检测）的纯转换解决方案。

Transformers have become the dominant model in natural language processing, owing to their ability to pretrain on massive amounts of data, then transfer to smaller, more specific tasks via fine-tuning. The Vision Transformer was the first major attempt to apply a pure transformer model directly to images as input, demonstrating that as compared to convolutional networks, transformer-based architectures can achieve competitive results on benchmark classification tasks. However, the computational complexity of the attention operator means that we are limited to low-resolution inputs. For more complex tasks such as detection or segmentation, maintaining a high input resolution is crucial to ensure that models can properly identify and reflect fine details in their output. This naturally raises the question of whether or not transformer-based architectures such as the Vision Transformer are capable of performing tasks other than classification. In this paper, we determine that Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results. The model that we propose, ViT-FRCNN, demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance. We also investigate improvements over a standard detection backbone, including superior performance on out-of-domain images, better performance on large objects, and a lessened reliance on non-maximum suppression. We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.

下载PDF全文

下载文献需遵守相关版权规定

论文标题