论文标题
开放式:3D场景与开放的词汇理解
OpenScene: 3D Scene Understanding with Open Vocabularies
论文作者
论文摘要
传统的3D场景理解方法取决于标记的3D数据集来训练模型,以通过监督进行单个任务。我们提出了OpenScene,这是一种替代方法,模型可以预测3D场景点的致密特征,这些特征与夹子特征空间中的文本和图像像素共同插入。这种零拍的方法可以实现任务不足的培训和开放式摄影查询。例如,要执行SOTA零射击3D语义分割,它首先要注入每3D点的夹子特征,然后根据与任意类标签的嵌入相似性进行分类。更有趣的是,它可以使一套开放式摄影场景理解以前从未做过的应用程序。例如,它允许用户输入任意文本查询,然后查看热图,指示场景的哪些部分匹配。我们的方法有效地在复杂的3D场景中识别对象,材料,负担,活动和房间类型,所有这些都使用未经标记的3D数据的单个型号进行了训练。
Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries. For example, to perform SOTA zero-shot 3D semantic segmentation it first infers CLIP features for every 3D point and later classifies them based on similarities to embeddings of arbitrary class labels. More interestingly, it enables a suite of open-vocabulary scene understanding applications that have never been done before. For example, it allows a user to enter an arbitrary text query and then see a heat map indicating which parts of a scene match. Our approach is effective at identifying objects, materials, affordances, activities, and room types in complex 3D scenes, all using a single model trained without any labeled 3D data.