论文标题
蒙版引导的视觉变压器(MG-VIT),用于几次学习
Mask-guided Vision Transformer (MG-ViT) for Few-Shot Learning
论文作者
论文摘要
在很少的数据中学习是具有挑战性的,但在标记数据有限且昂贵的各种应用程序方案中通常不可避免。最近,由于知识对仅包含几个样本的新任务的普遍性,很少有射击学习(FSL)引起了人们的关注。但是,对于诸如视觉变压器(VIT)之类的数据密集型模型,当前基于微调的FSL方法在知识概括方面效率低下,从而使下游任务性能退化。在本文中,我们提出了一种新型的面具引导的视觉变压器(MG-VIT),以在VIT模型上实现有效而有效的FSL。关键的想法是在图像补丁上施加掩码,以筛选任务 - iRrelevant的掩码,并指导VIT专注于FSL期间与任务相关和判别贴片。特别是,MG-VIT仅引入额外的掩码操作和残留连接,从而无需任何其他成本就可以从预训练的VIT继承参数。为了最佳选择代表性的少数样品,我们还包括一种基于活跃的学习样本选择方法,以进一步提高基于MG-VIT的FSL的普遍性。我们以梯度加权类激活映射(GRAD-CAM)为掩码,评估了对农业象征分类任务和ACFR Apple检测任务的建议MG-VIT。实验结果表明,与一般的基于微调的VIT模型相比,MG-VIT模型可显着提高性能,提供新颖的见解和一种具体方法,用于概括FSL的数据密集型和大规模的深度学习模型。
Learning with little data is challenging but often inevitable in various application scenarios where the labeled data is limited and costly. Recently, few-shot learning (FSL) gained increasing attention because of its generalizability of prior knowledge to new tasks that contain only a few samples. However, for data-intensive models such as vision transformer (ViT), current fine-tuning based FSL approaches are inefficient in knowledge generalization and thus degenerate the downstream task performances. In this paper, we propose a novel mask-guided vision transformer (MG-ViT) to achieve an effective and efficient FSL on ViT model. The key idea is to apply a mask on image patches to screen out the task-irrelevant ones and to guide the ViT to focus on task-relevant and discriminative patches during FSL. Particularly, MG-ViT only introduces an additional mask operation and a residual connection, enabling the inheritance of parameters from pre-trained ViT without any other cost. To optimally select representative few-shot samples, we also include an active learning based sample selection method to further improve the generalizability of MG-ViT based FSL. We evaluate the proposed MG-ViT on both Agri-ImageNet classification task and ACFR apple detection task with gradient-weighted class activation mapping (Grad-CAM) as the mask. The experimental results show that the MG-ViT model significantly improves the performance when compared with general fine-tuning based ViT models, providing novel insights and a concrete approach towards generalizing data-intensive and large-scale deep learning models for FSL.