论文标题
视频传播:CT VFSS实例分段的时间混合视觉变压器
Video-TransUNet: Temporally Blended Vision Transformer for CT VFSS Instance Segmentation
论文作者
论文摘要
我们提出了Video-Transunet,这是一种深层体系结构,例如通过将时间功能融合到Transunet深度学习框架中构建的医学CT视频中的细分。特别是,我们的方法通过Resnet CNN骨架,通过时间上下文模块(TCM)融合的多帧特征(TCM),通过视觉变压器进行非本地关注,以及通过基于联合国基于多个目标的多个目标进行多个目标的重建功能。我们表明,在视频荧光吞咽研究(VFSS)CT序列中,对推注和咽/喉的分割进行测试时,这种新的网络设计可以显着超过其他最先进的系统。在我们的VFSS2022数据集上,它达到了0.8796的骰子系数,平均表面距离为1.0379像素。请注意,在临床实践中准确跟踪咽注:因为它构成了吞咽损伤诊断的主要方法。我们的发现表明,所提出的模型确实可以通过利用时间信息并通过明显的边距提高分割性能来增强Transunet架构。我们发布关键源代码,网络权重和地面真相注释,以简化性能再现。
We propose Video-TransUNet, a deep architecture for instance segmentation in medical CT videos constructed by integrating temporal feature blending into the TransUNet deep learning framework. In particular, our approach amalgamates strong frame representation via a ResNet CNN backbone, multi-frame feature blending via a Temporal Context Module (TCM), non-local attention via a Vision Transformer, and reconstructive capabilities for multiple targets via a UNet-based convolutional-deconvolutional architecture with multiple heads. We show that this new network design can significantly outperform other state-of-the-art systems when tested on the segmentation of bolus and pharynx/larynx in Videofluoroscopic Swallowing Study (VFSS) CT sequences. On our VFSS2022 dataset it achieves a dice coefficient of 0.8796 and an average surface distance of 1.0379 pixels. Note that tracking the pharyngeal bolus accurately is a particularly important application in clinical practice since it constitutes the primary method for diagnostics of swallowing impairment. Our findings suggest that the proposed model can indeed enhance the TransUNet architecture via exploiting temporal information and improving segmentation performance by a significant margin. We publish key source code, network weights, and ground truth annotations for simplified performance reproduction.