视觉扰动感知的协作学习以克服语言先验问题

论文标题

视觉扰动感知的协作学习以克服语言先验问题

Visual Perturbation-aware Collaborative Learning for Overcoming the Language Prior Problem

论文作者

Han, Yudong, Nie, Liqiang, Yin, Jianhua, Wu, Jianlong, Yan, Yan

论文摘要

最近的几项研究指出，现有的视觉问题回答（VQA）模型严重遭受了先前的问题的困扰，这是指捕获问题类型和答案之间的表面统计相关性，而忽略了图像内容。通过创建精致的模型或引入额外的视觉注释，已经致力于加强图像依赖性。但是，这些方法无法充分探讨视觉提示如何显式影响学习的答案表示，这对于减轻语言的依赖至关重要。此外，他们通常强调对学习的答案表示形式的班级歧视，这忽略了更精细的实例级别模式，并要求进一步优化。在本文中，我们从视觉扰动校准的角度提出了一种新颖的协作学习方案，该方案可以更好地研究细粒度的视觉效果，并通过学习实例级级特征来减轻语言的先验问题。具体而言，我们设计了一个视觉控制器来构造具有不同扰动范围的两种策划图像，基于该图像的协作学习内置内不变性和实体歧视的协作学习由两个精心设计的歧视者实现。此外，我们在潜在空间上实施信息瓶颈调制器，以进一步减轻偏见和表示校准。我们将视觉扰动感知的框架强加于三个正统基准，并在两个诊断性VQA-CP基准数据集中对实验结果显然表明了其有效性。此外，我们还证明了它在平衡的VQA基准上的鲁棒性是合理的。

Several studies have recently pointed that existing Visual Question Answering (VQA) models heavily suffer from the language prior problem, which refers to capturing superficial statistical correlations between the question type and the answer whereas ignoring the image contents. Numerous efforts have been dedicated to strengthen the image dependency by creating the delicate models or introducing the extra visual annotations. However, these methods cannot sufficiently explore how the visual cues explicitly affect the learned answer representation, which is vital for language reliance alleviation. Moreover, they generally emphasize the class-level discrimination of the learned answer representation, which overlooks the more fine-grained instance-level patterns and demands further optimization. In this paper, we propose a novel collaborative learning scheme from the viewpoint of visual perturbation calibration, which can better investigate the fine-grained visual effects and mitigate the language prior problem by learning the instance-level characteristics. Specifically, we devise a visual controller to construct two sorts of curated images with different perturbation extents, based on which the collaborative learning of intra-instance invariance and inter-instance discrimination is implemented by two well-designed discriminators. Besides, we implement the information bottleneck modulator on latent space for further bias alleviation and representation calibration. We impose our visual perturbation-aware framework to three orthodox baselines and the experimental results on two diagnostic VQA-CP benchmark datasets evidently demonstrate its effectiveness. In addition, we also justify its robustness on the balanced VQA benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题