通过交叉模式和跨视图对应关系进行自我监督的功能学习

论文标题

通过交叉模式和跨视图对应关系进行自我监督的功能学习

Self-supervised Feature Learning by Cross-modality and Cross-view Correspondences

论文作者

Jing, Longlong, Chen, Yucheng, Zhang, Ling, He, Mingyi, Tian, Yingli

论文摘要

监督学习的成功需要大规模的地面真相标签，这些标签非常昂贵，耗时，或者可能需要特殊技能来注释。为了解决这个问题，开发了许多自我监督的方法。与仅学习2D图像特征或仅3D点云功能的大多数现有自我监督的方法不同，本文提出了一种新颖而有效的自我监督学习方法，可以通过利用交叉模式和跨视图对应关系共同学习2D图像特征和3D点云特征，而无需使用任何人类注释的标签。具体而言，来自不同视图的渲染图像的2D图像特征由2D卷积神经网络提取，而3D点云特征由图形卷积神经网络提取。将两种类型的特征馈入两层完全连接的神经网络，以估计交叉模式对应关系。通过验证两个不同模态的两个样本数据是否属于同一对象，对三个网络进行了共同训练（即交叉模式），与此同时，通过最小化对象内距离，同时最大程度地利用了不同视图中置换图像的对象间距离（即交叉观察），则将2D卷积神经网络得到优化。通过将其传输在五个不同的任务中，包括多视图2D形状识别，3D形状识别，多视图2D形状检索，3D形状检索和3D零件分割，可以评估学习2D和3D特征的有效性。对不同数据集的所有五个不同任务的广泛评估表明，通过拟议的自我监督方法对学习的2D和3D特征的有效性进行了强烈的概括和有效性。

The success of supervised learning requires large-scale ground truth labels which are very expensive, time-consuming, or may need special skills to annotate. To address this issue, many self- or un-supervised methods are developed. Unlike most existing self-supervised methods to learn only 2D image features or only 3D point cloud features, this paper presents a novel and effective self-supervised learning approach to jointly learn both 2D image features and 3D point cloud features by exploiting cross-modality and cross-view correspondences without using any human annotated labels. Specifically, 2D image features of rendered images from different views are extracted by a 2D convolutional neural network, and 3D point cloud features are extracted by a graph convolution neural network. Two types of features are fed into a two-layer fully connected neural network to estimate the cross-modality correspondence. The three networks are jointly trained (i.e. cross-modality) by verifying whether two sampled data of different modalities belong to the same object, meanwhile, the 2D convolutional neural network is additionally optimized through minimizing intra-object distance while maximizing inter-object distance of rendered images in different views (i.e. cross-view). The effectiveness of the learned 2D and 3D features is evaluated by transferring them on five different tasks including multi-view 2D shape recognition, 3D shape recognition, multi-view 2D shape retrieval, 3D shape retrieval, and 3D part-segmentation. Extensive evaluations on all the five different tasks across different datasets demonstrate strong generalization and effectiveness of the learned 2D and 3D features by the proposed self-supervised method.

下载PDF全文

下载文献需遵守相关版权规定

论文标题