视觉语义对比度对齐，用于几个图像分类

论文标题

视觉语义对比度对齐，用于几个图像分类

Visual-Semantic Contrastive Alignment for Few-Shot Image Classification

论文作者

Afham, Mohamed, Rodrigo, Ranga

论文摘要

很少有学习的学习旨在训练和优化可以使用几个标签示例来适应看不见的视觉类别的模型。现有的少量学习方法（FSL）方法仅依赖于视觉数据，因此无法捕获语义属性，从而从很少的示例中学习了视觉概念的更广泛的版本。但是，众所周知的事实是，人类的视觉学习受益于多种方式，例如视觉，语言和音频。受到人类学习性质的启发，它以语言形式封装了视觉类别的现有知识，我们为视觉和语义特征矢量引入了一种对比度对齐机制，以学习更广泛的视觉概念，以进行几次学习。我们的方法只是添加了一个辅助对比学习目标，该目标除了现有的训练机制外，还从强大的文本编码器中捕获了视觉类别的上下文知识。因此，该方法更具概括性，可以插入任何现有的FSL方法中。我们在方法中使用的预先训练的语义功能提取器（从大规模文本语料库中汲取了汲取的内容）为有助于FSL提供了强大的上下文知识知识。在流行的FSL数据集中取得的实验结果表明，我们的方法本质上是通用的，并为现有的FSL基准提供了很大的推动。

Few-Shot learning aims to train and optimize a model that can adapt to unseen visual classes with only a few labeled examples. The existing few-shot learning (FSL) methods, heavily rely only on visual data, thus fail to capture the semantic attributes to learn a more generalized version of the visual concept from very few examples. However, it is a known fact that human visual learning benefits immensely from inputs from multiple modalities such as vision, language, and audio. Inspired by the human learning nature of encapsulating the existing knowledge of a visual category which is in the form of language, we introduce a contrastive alignment mechanism for visual and semantic feature vectors to learn much more generalized visual concepts for few-shot learning. Our method simply adds an auxiliary contrastive learning objective which captures the contextual knowledge of a visual category from a strong textual encoder in addition to the existing training mechanism. Hence, the approach is more generalized and can be plugged into any existing FSL method. The pre-trained semantic feature extractor (learned from a large-scale text corpora) we use in our approach provides a strong contextual prior knowledge to assist FSL. The experimental results done in popular FSL datasets show that our approach is generic in nature and provides a strong boost to the existing FSL baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题