开放式：3D场景与开放的词汇理解

论文标题

开放式：3D场景与开放的词汇理解

OpenScene: 3D Scene Understanding with Open Vocabularies

论文作者

Peng, Songyou, Genova, Kyle, Jiang, Chiyu "Max", Tagliasacchi, Andrea, Pollefeys, Marc, Funkhouser, Thomas

论文摘要

传统的3D场景理解方法取决于标记的3D数据集来训练模型，以通过监督进行单个任务。我们提出了OpenScene，这是一种替代方法，模型可以预测3D场景点的致密特征，这些特征与夹子特征空间中的文本和图像像素共同插入。这种零拍的方法可以实现任务不足的培训和开放式摄影查询。例如，要执行SOTA零射击3D语义分割，它首先要注入每3D点的夹子特征，然后根据与任意类标签的嵌入相似性进行分类。更有趣的是，它可以使一套开放式摄影场景理解以前从未做过的应用程序。例如，它允许用户输入任意文本查询，然后查看热图，指示场景的哪些部分匹配。我们的方法有效地在复杂的3D场景中识别对象，材料，负担，活动和房间类型，所有这些都使用未经标记的3D数据的单个型号进行了训练。

Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries. For example, to perform SOTA zero-shot 3D semantic segmentation it first infers CLIP features for every 3D point and later classifies them based on similarities to embeddings of arbitrary class labels. More interestingly, it enables a suite of open-vocabulary scene understanding applications that have never been done before. For example, it allows a user to enter an arbitrary text query and then see a heat map indicating which parts of a scene match. Our approach is effective at identifying objects, materials, affordances, activities, and room types in complex 3D scenes, all using a single model trained without any labeled 3D data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题