通过图像到点蒙版的自动编码器从2D预训练的模型中学习3D表示

论文标题

通过图像到点蒙版的自动编码器从2D预训练的模型中学习3D表示

Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

论文作者

Zhang, Renrui, Wang, Liuhui, Qiao, Yu, Gao, Peng, Li, Hongsheng

论文摘要

通过众多图像数据进行预训练已成为鲁棒2D表示的事实上。相比之下，由于昂贵的数据获取和注释，大规模3D数据集的匮乏严重阻碍了对高质量3D功能的学习。在本文中，我们提出了一种替代方案，可以通过图像对点蒙版的自动编码器从2D预训练的模型获得出色的3D表示，称为I2P-MAE。通过自我监督的预训练，我们利用了知识熟的2D知识来指导3D掩盖自动编码，从而使用编码器编码器体系结构重建了蒙版的点令牌。具体而言，我们首先使用现成的2D模型来提取输入点云的多视觉视觉特征，然后在顶部进行两种类型的图像到点学习方案。首先，我们引入了一种2D引导的掩蔽策略，该策略保持语义上重要的点令牌，对于编码器来说是可见的。与随机掩盖相比，网络可以更好地集中于重要的3D结构，并从关键的空间提示中恢复掩盖的令牌。另一方面，我们强制执行这些可见的令牌，以重建解码器后的相应多视图2D功能。这使网络能够有效继承从丰富的图像数据中学到的高级2D语义，以进行判别3D建模。在我们的图像到点预训练的帮助下，冷冻的I2P-MAE没有任何微调，在ModelNet40上实现了线性SVM的精度为93.4％，这是现有方法的全面训练结果的竞争力。通过对Scanobjectnn最严格的分割进行进一步的微调，I2P-MAE达到了最先进的90.11％精度， +3.68％ +3.68％，表现出了较高的可转移能力。代码将在https://github.com/zrrskywalker/i2p-mae上找到。

Pre-training by numerous image data has become de-facto for robust 2D representations. In contrast, due to the expensive data acquisition and annotation, a paucity of large-scale 3D datasets severely hinders the learning for high-quality 3D features. In this paper, we propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE. By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding, which reconstructs the masked point tokens with an encoder-decoder architecture. Specifically, we first utilize off-the-shelf 2D models to extract the multi-view visual features of the input point cloud, and then conduct two types of image-to-point learning schemes on top. For one, we introduce a 2D-guided masking strategy that maintains semantically important point tokens to be visible for the encoder. Compared to random masking, the network can better concentrate on significant 3D structures and recover the masked tokens from key spatial cues. For another, we enforce these visible tokens to reconstruct the corresponding multi-view 2D features after the decoder. This enables the network to effectively inherit high-level 2D semantics learned from rich image data for discriminative 3D modeling. Aided by our image-to-point pre-training, the frozen I2P-MAE, without any fine-tuning, achieves 93.4% accuracy for linear SVM on ModelNet40, competitive to the fully trained results of existing methods. By further fine-tuning on on ScanObjectNN's hardest split, I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity. Code will be available at https://github.com/ZrrSkywalker/I2P-MAE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题