论文标题
缺陷变压器:用于表面缺陷检测的有效混合变压器体系结构
Defect Transformer: An Efficient Hybrid Transformer Architecture for Surface Defect Detection
论文作者
论文摘要
表面缺陷检测是确保工业产品质量的极其至关重要的步骤。如今,基于编码器架构的卷积神经网络(CNN)在各种缺陷检测任务中取得了巨大的成功。但是,由于卷积的内在局部性,它们通常在明确建模长距离相互作用时表现出限制,这对于复杂情况下的像素缺陷检测至关重要,例如,杂乱的背景和难以辨认的伪缺陷。最近的变压器特别擅长学习全球图像依赖性,但详细的缺陷位置所需的本地结构信息有限。为了克服上述局限性,我们提出了一个有效的混合变压器体系结构,称为缺陷变压器(DEFT),以进行表面缺陷检测,该检测将CNN和Transferes纳入统一模型,以协作捕获本地和非本地关系。具体而言,在编码器模块中,首先采用了卷积茎块来保留更详细的空间信息。 Then, the patch aggregation blocks are used to generate multi-scale representation with four hierarchies, each of them is followed by a series of DefT blocks, which respectively include a locally position-aware block for local position encoding, a lightweight multi-pooling self-attention to model multi-scale global contextual relationships with good computational efficiency, and a convolutional feed-forward network for feature transformation and further location information learning.最后,提出了一个简单但有效的解码器模块,以从编码器的跳过连接中逐渐恢复空间细节。在三个数据集上进行的广泛实验证明了我们方法的优势和效率与其他基于CNN的网络相比。
Surface defect detection is an extremely crucial step to ensure the quality of industrial products. Nowadays, convolutional neural networks (CNNs) based on encoder-decoder architecture have achieved tremendous success in various defect detection tasks. However, due to the intrinsic locality of convolution, they commonly exhibit a limitation in explicitly modeling long-range interactions, critical for pixel-wise defect detection in complex cases, e.g., cluttered background and illegible pseudo-defects. Recent transformers are especially skilled at learning global image dependencies but with limited local structural information necessary for detailed defect location. To overcome the above limitations, we propose an efficient hybrid transformer architecture, termed Defect Transformer (DefT), for surface defect detection, which incorporates CNN and transformer into a unified model to capture local and non-local relationships collaboratively. Specifically, in the encoder module, a convolutional stem block is firstly adopted to retain more detailed spatial information. Then, the patch aggregation blocks are used to generate multi-scale representation with four hierarchies, each of them is followed by a series of DefT blocks, which respectively include a locally position-aware block for local position encoding, a lightweight multi-pooling self-attention to model multi-scale global contextual relationships with good computational efficiency, and a convolutional feed-forward network for feature transformation and further location information learning. Finally, a simple but effective decoder module is proposed to gradually recover spatial details from the skip connections in the encoder. Extensive experiments on three datasets demonstrate the superiority and efficiency of our method compared with other CNN- and transformer-based networks.