增强视觉变形金刚用于图像检索

论文标题

增强视觉变形金刚用于图像检索

Boosting vision transformers for image retrieval

论文作者

Song, Chull Hwan, Yoon, Jooyoung, Choi, Shunghyun, Avrithis, Yannis

论文摘要

视力变压器在视觉任务（例如图像分类和检测）方面取得了显着进步。但是，在实例级图像检索中，与卷积网络相比，变压器尚未显示出良好的性能。我们提出了许多改进，使变形金刚首次优于艺术状态。（1）我们证明，混合体系结构比普通变压器更有效。（2）我们介绍了两个分支，收集全局（分类令牌）和本地（补丁令牌）信息，从中我们形成了全局图像表示。（3）在每个分支中，我们从变压器编码器中收集多层特征，对应于跨远处的跳过连接。（4）我们在编码器的深层层上增强了相互作用的局部性，这是视觉变压器的相对弱点。我们在所有常用的训练集上训练模型，这是我们第一次进行公平的比较。在所有情况下，我们都基于全局表示均优于先前的模型。公共代码可在https://github.com/dealicious-inc/dtop上找到。

Vision transformers have achieved remarkable progress in vision tasks such as image classification and detection. However, in instance-level image retrieval, transformers have not yet shown good performance compared to convolutional networks. We propose a number of improvements that make transformers outperform the state of the art for the first time. (1) We show that a hybrid architecture is more effective than plain transformers, by a large margin. (2) We introduce two branches collecting global (classification token) and local (patch tokens) information, from which we form a global image representation. (3) In each branch, we collect multi-layer features from the transformer encoder, corresponding to skip connections across distant layers. (4) We enhance locality of interactions at the deeper layers of the encoder, which is the relative weakness of vision transformers. We train our model on all commonly used training sets and, for the first time, we make fair comparisons separately per training set. In all cases, we outperform previous models based on global representation. Public code is available at https://github.com/dealicious-inc/DToP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题