Thor Net：基于端到端的Graformer现实的两只手和对象重建与自学的重建

论文标题

Thor Net：基于端到端的Graformer现实的两只手和对象重建与自学的重建

THOR-Net: End-to-end Graformer-based Realistic Two Hands and Object Reconstruction with Self-supervision

论文作者

Aboukhadra, Ahmed Tawfik, Malik, Jameel, Elhayek, Ahmed, Robertini, Nadia, Stricker, Didier

论文摘要

两只手与物体互动的现实重建是一个新的且具有挑战性的问题，对于建立个性化的虚拟和增强现实环境至关重要。图形卷积网络（GCN）允许通过将它们作为图形进行建模，从而保留手姿势和形状的拓扑。在这项工作中，我们提出了将GCN，变压器和自upervision的力量结合起来的Thor-NET，以现实地从单个RGB图像中重建两只手和一个对象。我们的网络包括两个阶段；即特征提取阶段和重建阶段。在特征提取阶段，使用Kepoint RCNN从单眼RGB图像中提取2D姿势，特征地图，热图和边界框。此后，此2D信息被建模为两个图，并传递到重建阶段的两个分支。形状重建分支使用我们新颖的粗到细节的graformer形状网络估算了两只手的网格和一个对象。手和对象的3D姿势是使用Graformer网络重建的。最后，使用自我监管的光度损失用于直接回归手中每个顶点的现实纹理。我们的方法实现了最新的HO-3D数据集（10.0mm）的手工形状估计，超过了ArtiBoost（10.8mm）。它还超过了对挑战的两个手和对象（H2O）数据集的手动姿势估算的其他方法，左侧姿势上的5mm，右手姿势上有1毫米。

Realistic reconstruction of two hands interacting with objects is a new and challenging problem that is essential for building personalized Virtual and Augmented Reality environments. Graph Convolutional networks (GCNs) allow for the preservation of the topologies of hands poses and shapes by modeling them as a graph. In this work, we propose the THOR-Net which combines the power of GCNs, Transformer, and self-supervision to realistically reconstruct two hands and an object from a single RGB image. Our network comprises two stages; namely the features extraction stage and the reconstruction stage. In the features extraction stage, a Keypoint RCNN is used to extract 2D poses, features maps, heatmaps, and bounding boxes from a monocular RGB image. Thereafter, this 2D information is modeled as two graphs and passed to the two branches of the reconstruction stage. The shape reconstruction branch estimates meshes of two hands and an object using our novel coarse-to-fine GraFormer shape network. The 3D poses of the hands and objects are reconstructed by the other branch using a GraFormer network. Finally, a self-supervised photometric loss is used to directly regress the realistic textured of each vertex in the hands' meshes. Our approach achieves State-of-the-art results in Hand shape estimation on the HO-3D dataset (10.0mm) exceeding ArtiBoost (10.8mm). It also surpasses other methods in hand pose estimation on the challenging two hands and object (H2O) dataset by 5mm on the left-hand pose and 1 mm on the right-hand pose.

下载PDF全文

下载文献需遵守相关版权规定

论文标题