Monovit：具有视觉变压器的自我监督的单眼深度估计

论文标题

Monovit：具有视觉变压器的自我监督的单眼深度估计

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer

论文作者

Zhao, Chaoqiang, Zhang, Youmin, Poggi, Matteo, Tosi, Fabio, Guo, Xianda, Zhu, Zheng, Huang, Guan, Tang, Yang, Mattoccia, Stefano

论文摘要

自我监督的单眼深度估计是一种有吸引力的解决方案，不需要难以供应的深度标签进行训练。卷积神经网络（CNN）最近在这项任务中取得了巨大成功。但是，他们的受欢迎的领域有限地限制了现有的网络体系结构，以便在本地进行推理，从而抑制了自我监管的范式的有效性。鉴于Vision Transformers（VIT）最近取得的成功，我们提出了Monovit，这是一个崭新的框架，结合了VIT模型启用的全球推理和自我监督的单眼深度估计的灵活性。通过将普通卷积与变压器块相结合，我们的模型可以在本地和全球范围内推理，从而在更高的细节和准确性下产生深度预测，从而使MonoVit能够在已建立的Kitti数据集中实现最先进的性能。此外，Monovit证明了其在其他数据集（例如Make3D和Drivingstereo）上的出色概括能力。

Self-supervised monocular depth estimation is an attractive solution that does not require hard-to-source depth labels for training. Convolutional neural networks (CNNs) have recently achieved great success in this task. However, their limited receptive field constrains existing network architectures to reason only locally, dampening the effectiveness of the self-supervised paradigm. In the light of the recent successes achieved by Vision Transformers (ViTs), we propose MonoViT, a brand-new framework combining the global reasoning enabled by ViT models with the flexibility of self-supervised monocular depth estimation. By combining plain convolutions with Transformer blocks, our model can reason locally and globally, yielding depth prediction at a higher level of detail and accuracy, allowing MonoViT to achieve state-of-the-art performance on the established KITTI dataset. Moreover, MonoViT proves its superior generalization capacities on other datasets such as Make3D and DrivingStereo.

下载PDF全文

下载文献需遵守相关版权规定

论文标题