为3D语义细分的更深，更好的多视图融合

论文标题

为3D语义细分的更深，更好的多视图融合

Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic Segmentation

论文作者

Yang, Chaolong, Yan, Yuyao, Zhao, Weiguang, Ye, Jianan, Yang, Xi, Hussain, Amir, Huang, Kaizhu

论文摘要

3D点云富含几何结构信息，而2D图像包含重要且连续的纹理信息。结合2D信息以实现更好的3D语义细分已成为3D场景理解中的主流。尽管取得了成功，但仍然难以捉摸地如何融合和处理这两个不同空间的跨维特征。现有的最先进的通常利用双向投影方法来对齐跨二维特征并同时实现2D和3D语义分割任务。但是，为了启用双向映射，该框架通常需要对称的2D-3D网络结构，从而限制了网络的灵活性。同时，这种双重任务设置可能会轻松分散网络的注意力，并导致3D分割任务中的过度合适。受网络的僵化性限制，融合功能只能通过解码器网络，这会影响深度不足而导致的模型性能。为了减轻这些缺点，在本文中，我们认为，尽管具有简单性，但在与3D深度语义特征对齐的3D空间中，单向多视图2D 2D深度语义特征可以带来更好的功能融合。一方面，单向投影强制执行我们的模型更多地集中在核心任务上，即3D细分；另一方面，解锁双向投影可以使更深的跨域语义对准能够融合到截然不同的空间中的更好，更复杂的特征。在第2D-3D联合方法中，我们提出的方法在ScannETV2基准测试中获得了3D语义分割的卓越性能。

3D point clouds are rich in geometric structure information, while 2D images contain important and continuous texture information. Combining 2D information to achieve better 3D semantic segmentation has become mainstream in 3D scene understanding. Albeit the success, it still remains elusive how to fuse and process the cross-dimensional features from these two distinct spaces. Existing state-of-the-art usually exploit bidirectional projection methods to align the cross-dimensional features and realize both 2D & 3D semantic segmentation tasks. However, to enable bidirectional mapping, this framework often requires a symmetrical 2D-3D network structure, thus limiting the network's flexibility. Meanwhile, such dual-task settings may distract the network easily and lead to over-fitting in the 3D segmentation task. As limited by the network's inflexibility, fused features can only pass through a decoder network, which affects model performance due to insufficient depth. To alleviate these drawbacks, in this paper, we argue that despite its simplicity, projecting unidirectionally multi-view 2D deep semantic features into the 3D space aligned with 3D deep semantic features could lead to better feature fusion. On the one hand, the unidirectional projection enforces our model focused more on the core task, i.e., 3D segmentation; on the other hand, unlocking the bidirectional to unidirectional projection enables a deeper cross-domain semantic alignment and enjoys the flexibility to fuse better and complicated features from very different spaces. In joint 2D-3D approaches, our proposed method achieves superior performance on the ScanNetv2 benchmark for 3D semantic segmentation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题