语言可以理解深度吗？

论文标题

语言可以理解深度吗？

Can Language Understand Depth?

论文作者

Zhang, Renrui, Zeng, Ziyao, Guo, Ziyu, Li, Yafeng

论文摘要

除了图像分类外，对比性语言图像预训练（剪辑）在广泛的视觉任务（包括对象级别和3D空间理解）方面取得了非凡的成功。但是，将从剪辑中学到的语义知识转移到量化目标的更复杂的任务中，例如使用几何信息的深度估计，仍然具有挑战性。在本文中，我们建议将剪辑应用于零拍的单眼估计，称为Depthclip。我们发现，输入图像的斑块可以响应一定的语义距离令牌，然后将其投影到量化的深度箱中进行粗估算。在没有任何培训的情况下，我们的深度算法超过了现有的无监督方法，甚至接近了早期全面监督的网络。据我们最大的知识，我们是第一个从语义语言知识进行零拍适应的人，以量化下游任务并执行零拍的单眼深度估计。我们希望我们的工作能够阐明未来的研究。该代码可在https://github.com/adonis-galaxy/depthclip上找到。

Besides image classification, Contrastive Language-Image Pre-training (CLIP) has accomplished extraordinary success for a wide range of vision tasks, including object-level and 3D space understanding. However, it's still challenging to transfer semantic knowledge learned from CLIP into more intricate tasks of quantified targets, such as depth estimation with geometric information. In this paper, we propose to apply CLIP for zero-shot monocular depth estimation, named DepthCLIP. We found that the patches of the input image could respond to a certain semantic distance token and then be projected to a quantified depth bin for coarse estimation. Without any training, our DepthCLIP surpasses existing unsupervised methods and even approaches the early fully-supervised networks. To our best knowledge, we are the first to conduct zero-shot adaptation from the semantic language knowledge to quantified downstream tasks and perform zero-shot monocular depth estimation. We hope our work could cast a light on future research. The code is available at https://github.com/Adonis-galaxy/DepthCLIP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题