论文标题
通过引用文本短语来解释可解释和细粒的3D接地
Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases
论文作者
论文摘要
3D场景理解中的最新进展探索了视觉接地(3DVG),以通过语言描述本地化目标对象。但是,现有方法仅考虑整个句子和目标对象之间的依赖性,忽略上下文与非目标之间的细粒度关系。在本文中,我们将3DVG扩展到更细粒度和可解释的任务,称为3D短语意识接地(3DPAG)。 3DPAG任务旨在通过明确识别所有与短语相关的对象,然后根据上下文短语进行推理,旨在将目标对象定位在3D场景中。为了解决这个问题,我们使用一个自开发的平台手动标记了约227K短语级别的注释,从88k使用的3DVG数据集的句子,即NR3D,SR3D和ScanRefer。通过利用数据集,我们可以将先前的3DVG方法扩展到细粒度的短语感知场景。它是通过提出的新短语对象对准优化和短语特异性预训练的方法来实现的,也可以提高常规3DVG性能。广泛的结果证实了显着改善,即先前的最新方法可在NR3D,SR3D和ScanRefer上获得3.9%,3.5%和4.6%的总体准确性提高。
Recent progress in 3D scene understanding has explored visual grounding (3DVG) to localize a target object through a language description. However, existing methods only consider the dependency between the entire sentence and the target object, ignoring fine-grained relationships between contexts and non-target ones. In this paper, we extend 3DVG to a more fine-grained and interpretable task, called 3D Phrase Aware Grounding (3DPAG). The 3DPAG task aims to localize the target objects in a 3D scene by explicitly identifying all phrase-related objects and then conducting the reasoning according to contextual phrases. To tackle this problem, we manually labeled about 227K phrase-level annotations using a self-developed platform, from 88K sentences of widely used 3DVG datasets, i.e., Nr3D, Sr3D and ScanRefer. By tapping on our datasets, we can extend previous 3DVG methods to the fine-grained phrase-aware scenario. It is achieved through the proposed novel phrase-object alignment optimization and phrase-specific pre-training, boosting conventional 3DVG performance as well. Extensive results confirm significant improvements, i.e., previous state-of-the-art method achieves 3.9%, 3.5% and 4.6% overall accuracy gains on Nr3D, Sr3D and ScanRefer respectively.