蒙版引导的视觉变压器（MG-VIT），用于几次学习

论文标题

蒙版引导的视觉变压器（MG-VIT），用于几次学习

Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning

论文作者

Chen, Yuzhong, Xiao, Zhenxiang, Zhao, Lin, Zhang, Lu, Dai, Haixing, Liu, David Weizhong, Wu, Zihao, Li, Changhe, Zhang, Tuo, Li, Changying, Zhu, Dajiang, Liu, Tianming, Jiang, Xi

论文摘要

在很少的数据中学习是具有挑战性的，但在标记数据有限且昂贵的各种应用程序方案中通常不可避免。最近，由于知识对仅包含几个样本的新任务的普遍性，很少有射击学习（FSL）引起了人们的关注。但是，对于诸如视觉变压器（VIT）之类的数据密集型模型，当前基于微调的FSL方法在知识概括方面效率低下，从而使下游任务性能退化。在本文中，我们提出了一种新型的面具引导的视觉变压器（MG-VIT），以在VIT模型上实现有效而有效的FSL。关键的想法是在图像补丁上施加掩码，以筛选任务 - iRrelevant的掩码，并指导VIT专注于FSL期间与任务相关和判别贴片。特别是，MG-VIT仅引入额外的掩码操作和残留连接，从而无需任何其他成本就可以从预训练的VIT继承参数。为了最佳选择代表性的少数样品，我们还包括一种基于活跃的学习样本选择方法，以进一步提高基于MG-VIT的FSL的普遍性。我们以梯度加权类激活映射（GRAD-CAM）为掩码，评估了对农业象征分类任务和ACFR Apple检测任务的建议MG-VIT。实验结果表明，与一般的基于微调的VIT模型相比，MG-VIT模型可显着提高性能，提供新颖的见解和一种具体方法，用于概括FSL的数据密集型和大规模的深度学习模型。

Learning with little data is challenging but often inevitable in various application scenarios where the labeled data is limited and costly. Recently, few-shot learning (FSL) gained increasing attention because of its generalizability of prior knowledge to new tasks that contain only a few samples. However, for data-intensive models such as vision transformer (ViT), current fine-tuning based FSL approaches are inefficient in knowledge generalization and thus degenerate the downstream task performances. In this paper, we propose a novel mask-guided vision transformer (MG-ViT) to achieve an effective and efficient FSL on ViT model. The key idea is to apply a mask on image patches to screen out the task-irrelevant ones and to guide the ViT to focus on task-relevant and discriminative patches during FSL. Particularly, MG-ViT only introduces an additional mask operation and a residual connection, enabling the inheritance of parameters from pre-trained ViT without any other cost. To optimally select representative few-shot samples, we also include an active learning based sample selection method to further improve the generalizability of MG-ViT based FSL. We evaluate the proposed MG-ViT on both Agri-ImageNet classification task and ACFR apple detection task with gradient-weighted class activation mapping (Grad-CAM) as the mask. The experimental results show that the MG-ViT model significantly improves the performance when compared with general fine-tuning based ViT models, providing novel insights and a concrete approach towards generalizing data-intensive and large-scale deep learning models for FSL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题