论文标题
多尺度注意U-NET(MSAUNET):场景分割的修改后的U-NET体系结构
Multi-scale Attention U-Net (MsAUNet): A Modified U-Net Architecture for Scene Segmentation
论文作者
论文摘要
尽管最近在场景分割任务中,卷积神经网络(CNN)的成功越来越大,但标准模型缺乏一些可能导致优化分段输出的重要特征。广泛使用的编码器架构架构提取物,并在不同的步骤和不同的尺度上使用几个冗余和低级功能。同样,这些网络无法映射本地特征的远程依赖性,从而导致与所得分段图像中每个语义类相对应的判别特征图。在本文中,我们通过使用图像中的丰富上下文信息提出了一个新型的多尺度注意力网络,以实现场景分割目的。与原始的UNET体系结构不同,我们使用了注意力门,这些闸门以编码器的特征为输入和产生的输入,并与先前的金字塔池层的上采样输出进一步相连,并映射到下一个后续层。该网络可以通过仅关注相关的本地特征来绘制其精度的提高,并以提高精度映射本地功能,并强调判别图像区域。我们还通过优化IOU损失和融合骰子损失和加权交叉渗透损失来提出复合损失函数,以更快的收敛速率获得最佳解决方案。我们已经在名为PascalVoc2012和ADE20K的两个标准数据集上评估了我们的模型,并在两个数据集上分别能够达到79.88%和44.88%的平均值,并将我们的结果与广泛已知的模型进行了比较,以证明我们的模型优越性。
Despite the growing success of Convolution neural networks (CNN) in the recent past in the task of scene segmentation, the standard models lack some of the important features that might result in sub-optimal segmentation outputs. The widely used encoder-decoder architecture extracts and uses several redundant and low-level features at different steps and different scales. Also, these networks fail to map the long-range dependencies of local features, which results in discriminative feature maps corresponding to each semantic class in the resulting segmented image. In this paper, we propose a novel multi-scale attention network for scene segmentation purposes by using the rich contextual information from an image. Different from the original UNet architecture we have used attention gates which take the features from the encoder and the output of the pyramid pool as input and produced out-put is further concatenated with the up-sampled output of the previous pyramid-pool layer and mapped to the next subsequent layer. This network can map local features with their global counterparts with improved accuracy and emphasize on discriminative image regions by focusing on relevant local features only. We also propose a compound loss function by optimizing the IoU loss and fusing Dice Loss and Weighted Cross-entropy loss with it to achieve an optimal solution at a faster convergence rate. We have evaluated our model on two standard datasets named PascalVOC2012 and ADE20k and was able to achieve mean IoU of 79.88% and 44.88% on the two datasets respectively, and compared our result with the widely known models to prove the superiority of our model over them.