论文标题
无软磁性线性变压器
Softmax-free Linear Transformers
论文作者
论文摘要
视觉变形金刚(VIT)推动了视觉感知任务的最新技术。支持VIT强度的自我注意机制在计算和记忆使用中都具有二次复杂性。这激发了在线性复杂性下近似自我注意力的发展。但是,这项工作中的深入分析表明,现有方法在理论上是有缺陷的,要么是视觉识别的经验无效。我们确定它们的局限性植根于近似过程中基于软max的自我注意力的遗传,即使用SoftMax函数将令牌特征向量之间的缩放点产物归一化。在保留软件操作时,会挑战任何随后的线性化工作。通过这种见解,提出了一个无软磁变压器(软)的家族。具体而言,采用高斯内核函数来替代点产物相似性,从而使完整的自我发项矩阵能够在低级别基质分解下近似。对于计算鲁棒性,我们仅在向前的过程中使用纽顿 - 拉夫森方法估算摩尔 - 柔性反向,同时在向后过程中仅计算一次其理论梯度。为了进一步扩大适用性(例如,密集的预测任务),引入了有效的对称标准化技术。对Imagenet,Coco和ADE20K的广泛实验表明,我们的软可以显着提高现有VIT变体的计算效率。凭借线性复杂性,较长的令牌序列被软允许,从而导致准确性和复杂性之间的卓越权衡。代码和型号可在https://github.com/fudan-zvg/soft上找到。
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-attention mechanism underpinning the strength of ViTs has a quadratic complexity in both computation and memory usage. This motivates the development of approximating the self-attention at linear complexity. However, an in-depth analysis in this work reveals that existing methods are either theoretically flawed or empirically ineffective for visual recognition. We identify that their limitations are rooted in the inheritance of softmax-based self-attention during approximations, that is, normalizing the scaled dot-product between token feature vectors using the softmax function. As preserving the softmax operation challenges any subsequent linearization efforts. By this insight, a family of Softmax-Free Transformers (SOFT) are proposed. Specifically, a Gaussian kernel function is adopted to replace the dot-product similarity, enabling a full self-attention matrix to be approximated under low-rank matrix decomposition. For computational robustness, we estimate the Moore-Penrose inverse using an iterative Newton-Raphson method in the forward process only, while calculating its theoretical gradients only once in the backward process. To further expand applicability (e.g., dense prediction tasks), an efficient symmetric normalization technique is introduced. Extensive experiments on ImageNet, COCO, and ADE20K show that our SOFT significantly improves the computational efficiency of existing ViT variants. With linear complexity, much longer token sequences are permitted by SOFT, resulting in superior trade-off between accuracy and complexity. Code and models are available at https://github.com/fudan-zvg/SOFT.