未发现：统一的视觉变压器模型和3D医疗图像分割的预训练框架

论文标题

未发现：统一的视觉变压器模型和3D医疗图像分割的预训练框架

UNetFormer: A Unified Vision Transformer Model and Pre-Training Framework for 3D Medical Image Segmentation

论文作者

Hatamizadeh, Ali, Xu, Ziyue, Yang, Dong, Li, Wenqi, Roth, Holger, Xu, Daguang

论文摘要

视觉变压器（VIT）由于出色的建模功能，特别是用于捕获长期信息以及对数据集和模型尺寸的可扩展性，这导致了各种计算机视觉和医疗图像分析任务，这已引起了人们的流行。在这项工作中，我们介绍了一个统一的框架，该框架由两个架构组成，称为UneTefer，其中包括3D SWIN变压器的编码器和卷积神经网络（CNN）和基于变压器的解码器。在拟议的模型中，编码器通过跳过连接在五个不同的分辨率下通过五个不同的分辨率链接到解码器。拟议的体系结构的设计允许满足准确性和计算成本之间的广泛权衡要求。此外，我们提出了一种通过学习使用可见令牌的上下文信息来预测编码器主链的自我监督预训练的方法。我们将我们的框架预先培训，该框架由$ 5050 $ CT图像组成，并从公开可用的CT数据集收集，并对各种组件进行系统的研究，例如掩盖比率和补丁大小，影响下游任务的表示能力和性能。我们通过对肝脏和肝脏肿瘤分割的模型进行微调和测试，使用医疗分割Decathlon（MSD）数据集来验证我们的训练方法的有效性，并在各种分段指标方面实现最新性能。为了证明其概括性，我们使用MRI图像训练和测试了Brats 21数据集上的模型，以使用MRI图像和其他方法在骰子分数方面的表现优于其他方法。代码：https：//github.com/project-monai/research-contributions

Vision Transformers (ViT)s have recently become popular due to their outstanding modeling capabilities, in particular for capturing long-range information, and scalability to dataset and model sizes which has led to state-of-the-art performance in various computer vision and medical image analysis tasks. In this work, we introduce a unified framework consisting of two architectures, dubbed UNetFormer, with a 3D Swin Transformer-based encoder and Convolutional Neural Network (CNN) and transformer-based decoders. In the proposed model, the encoder is linked to the decoder via skip connections at five different resolutions with deep supervision. The design of proposed architecture allows for meeting a wide range of trade-off requirements between accuracy and computational cost. In addition, we present a methodology for self-supervised pre-training of the encoder backbone via learning to predict randomly masked volumetric tokens using contextual information of visible tokens. We pre-train our framework on a cohort of $5050$ CT images, gathered from publicly available CT datasets, and present a systematic investigation of various components such as masking ratio and patch size that affect the representation learning capability and performance of downstream tasks. We validate the effectiveness of our pre-training approach by fine-tuning and testing our model on liver and liver tumor segmentation task using the Medical Segmentation Decathlon (MSD) dataset and achieve state-of-the-art performance in terms of various segmentation metrics. To demonstrate its generalizability, we train and test the model on BraTS 21 dataset for brain tumor segmentation using MRI images and outperform other methods in terms of Dice score. Code: https://github.com/Project-MONAI/research-contributions

下载PDF全文

下载文献需遵守相关版权规定

论文标题