深腹腔镜立体声与变压器匹配

论文标题

深腹腔镜立体声与变压器匹配

Deep Laparoscopic Stereo Matching with Transformers

论文作者

Cheng, Xuelian, Zhong, Yiran, Harandi, Mehrtash, Drummond, Tom, Wang, Zhiyong, Ge, Zongyuan

论文摘要

在许多计算机视觉任务（包括图像识别和对象检测）中，成功地使用了变压器结构成功使用的自我发挥机制。尽管激增，但使用变压器来立体声匹配问题仍然相对尚未探索。在本文中，我们全面研究了变压器在立体声匹配的问题中的使用，尤其是对于腹腔镜视频，并提出了一个新的混合深度立体声匹配框架（Hybridstereonet），该框架（Hybridstereonet）结合了CNN的最佳和统一设计中的变压器。具体来说，我们研究了几种方法，通过分析设计的损耗格局和内域/跨域准确性，将变压器引入体积立体声匹配管道。我们的分析表明，与其他选项相比，使用CNN进行成本汇总的同时，使用变压器进行功能表示学习，而使用CNN进行成本汇总会导致更快的收敛，更高的准确性和更好的概括。我们在SceneFlow上进行的广泛实验，Scared2019和DVPN数据集证明了Hybridstereonet的出色性能。

The self-attention mechanism, successfully employed with the transformer structure is shown promise in many computer vision tasks including image recognition, and object detection. Despite the surge, the use of the transformer for the problem of stereo matching remains relatively unexplored. In this paper, we comprehensively investigate the use of the transformer for the problem of stereo matching, especially for laparoscopic videos, and propose a new hybrid deep stereo matching framework (HybridStereoNet) that combines the best of the CNN and the transformer in a unified design. To be specific, we investigate several ways to introduce transformers to volumetric stereo matching pipelines by analyzing the loss landscape of the designs and in-domain/cross-domain accuracy. Our analysis suggests that employing transformers for feature representation learning, while using CNNs for cost aggregation will lead to faster convergence, higher accuracy and better generalization than other options. Our extensive experiments on Sceneflow, SCARED2019 and dVPN datasets demonstrate the superior performance of our HybridStereoNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题