论文标题

场景图中新型组成的图形密度感知损失

Graph Density-Aware Losses for Novel Compositions in Scene Graph Generation

论文作者

Knyazev, Boris, de Vries, Harm, Cangea, Cătălina, Taylor, Graham W., Courville, Aaron, Belilovsky, Eugene

论文摘要

场景图生成(SGG)旨在以对象及其之间的关系形式预测输入图像的图形结构化描述。在视觉和语言的界面上,这项任务变得越来越有用。在这里,在小说(零射)或稀有(少数射击)对象和关系的组成上表现良好,这一点很重要。在本文中,我们确定了限制这种概括的两个关键问题。首先,我们表明该任务中使用的标准损失无意识地是场景图密度的函数。这导致在训练过程中忽略了大型稀疏图中各个边缘的忽视,尽管这些示例包含对概括很重要的几种示例。其次,关系的频率会在此任务中产生强烈的偏见,从而使盲目模型预测最频繁的关系可以实现良好的性能。因此,一些最先进的模型利用了这种偏见来改善结果。我们表明,这样的模型在推广到稀有构图的能力上可能会受到最大的影响,从而评估了视觉基因组数据集中的两个不同模型及其最新的改进版本GQA。为了解决这些问题,我们引入了密度归一化的边缘损失,该损失可在某些概括指标方面提高了两倍以上。与朝这个方向上的其他作品相比,我们的增强功能仅需要几行代码,而无需增加计算成本。我们还强调了使用现有指标准确评估模型的困难,尤其是在零/几镜头上,并引入了一种新型的加权度量。

Scene graph generation (SGG) aims to predict graph-structured descriptions of input images, in the form of objects and relationships between them. This task is becoming increasingly useful for progress at the interface of vision and language. Here, it is important - yet challenging - to perform well on novel (zero-shot) or rare (few-shot) compositions of objects and relationships. In this paper, we identify two key issues that limit such generalization. Firstly, we show that the standard loss used in this task is unintentionally a function of scene graph density. This leads to the neglect of individual edges in large sparse graphs during training, even though these contain diverse few-shot examples that are important for generalization. Secondly, the frequency of relationships can create a strong bias in this task, such that a blind model predicting the most frequent relationship achieves good performance. Consequently, some state-of-the-art models exploit this bias to improve results. We show that such models can suffer the most in their ability to generalize to rare compositions, evaluating two different models on the Visual Genome dataset and its more recent, improved version, GQA. To address these issues, we introduce a density-normalized edge loss, which provides more than a two-fold improvement in certain generalization metrics. Compared to other works in this direction, our enhancements require only a few lines of code and no added computational cost. We also highlight the difficulty of accurately evaluating models using existing metrics, especially on zero/few shots, and introduce a novel weighted metric.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源