论文标题
VGSE:零拍学习的视觉界面语义嵌入
VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning
论文作者
论文摘要
人类宣传的属性是零射学习中强大的语义嵌入。但是,他们的注释过程是劳动密集型的,需要专家监督。当前无监督的语义嵌入,即单词嵌入,可以在类之间进行知识转移。但是,单词嵌入并不总是反映视觉相似性,并导致零拍的性能。我们建议在不需要任何人类注释的情况下发现含有零拍学习的歧视性视觉特性的语义嵌入。我们的模型根据其视觉相似性将视觉上的一组图像从可见的类别分为本地图像区域的簇,并进一步强加了他们的类别歧视和语义相关性。为了将这些群集与以前看不见的类相关联,我们使用外部知识,例如单词嵌入并提出了一种新颖的类关系发现模块。通过定量和定性评估,我们证明了我们的模型发现了对可见类和看不见类的视觉特性进行建模的语义嵌入。此外,我们在三个基准上证明了我们的视觉上的语义嵌入进一步提高了各种ZSL模型中单词嵌入的性能,从而大幅度的边距。
Human-annotated attributes serve as powerful semantic embeddings in zero-shot learning. However, their annotation process is labor-intensive and needs expert supervision. Current unsupervised semantic embeddings, i.e., word embeddings, enable knowledge transfer between classes. However, word embeddings do not always reflect visual similarities and result in inferior zero-shot performance. We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning, without requiring any human annotation. Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity, and further imposes their class discrimination and semantic relatedness. To associate these clusters with previously unseen classes, we use external knowledge, e.g., word embeddings and propose a novel class relation discovery module. Through quantitative and qualitative evaluation, we demonstrate that our model discovers semantic embeddings that model the visual properties of both seen and unseen classes. Furthermore, we demonstrate on three benchmarks that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.