3D对象接地语言条件的空间关系推理

论文标题

3D对象接地语言条件的空间关系推理

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

论文作者

Chen, Shizhe, Guhur, Pierre-Louis, Tapaswi, Makarand, Schmid, Cordelia, Laptev, Ivan

论文摘要

基于自然语言的3D场景中的对象需要理解和推理空间关系。特别是，区分文本所指的类似物体通常至关重要，例如“左椅子最左翼”和“窗户旁边的椅子”。在这项工作中，我们提出了一个具有语言条件的变压器模型，用于接地3D对象及其空间关系。为此，我们设计了一个空间自我发项层，该层说明了输入3D点云中对象之间的相对距离和方向。使用视觉和语言输入训练这样的层，可以消除歧义空间关系并定位文本所述的对象。为了促进跨模式的关系学习，我们进一步提出了一种教师研究方法，首先使用地面真实对象标签对教师模型进行了培训，然后帮助使用点云输入来培训学生模型。我们进行消融研究，显示了我们方法的优势。我们还展示了我们的模型，以极大地超过挑战性的NR3D，SR3D和ScanRefer 3D对象接地数据集的最新状态。

Localizing objects in 3D scenes based on natural language requires understanding and reasoning about spatial relations. In particular, it is often crucial to distinguish similar objects referred by the text, such as "the left most chair" and "a chair next to the window". In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations. To this end, we design a spatial self-attention layer that accounts for relative distances and orientations between objects in input 3D point clouds. Training such a layer with visual and language inputs enables to disambiguate spatial relations and to localize objects referred by the text. To facilitate the cross-modal learning of relations, we further propose a teacher-student approach where the teacher model is first trained using ground-truth object labels, and then helps to train a student model using point cloud inputs. We perform ablation studies showing advantages of our approach. We also demonstrate our model to significantly outperform the state of the art on the challenging Nr3D, Sr3D and ScanRefer 3D object grounding datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题