Bivit：极度压缩的二进制视觉变压器

论文标题

Bivit：极度压缩的二进制视觉变压器

BiViT: Extremely Compressed Binary Vision Transformer

论文作者

He, Yefei, Lou, Zhenyu, Zhang, Luoming, Liu, Jing, Wu, Weijia, Zhou, Hong, Zhuang, Bohan

论文摘要

模型二进制化可以显着压缩模型大小，减少能源消耗，并通过有效的位操作加速推断。尽管已经对卷积神经网络进行了广泛的研究，但在探索视觉变压器的二进制方面几乎没有工作，这是视觉识别中最新突破的最新突破。为此，我们建议解决两个基本挑战，以推动二元视觉变压器（Bivit）的地平线。首先，传统的二进制方法不会考虑长期的SoftMax注意力分布，从而在注意模块中带来了很大的二进制错误。为了解决这个问题，我们提出了SoftMax-Aware Binarization，该二进制化动态适应数据分布并减少由二进制化引起的误差。其次，为了更好地保留预审预周封的模型的信息并恢复准确性，我们提出了一种跨层的二进制方案，该方案将自我注意力和多层求解器（MLP）（MLP）（MLP）和参数化的重量尺度分解，并引入可学习的缩放量表以实现重量二进制。总体而言，我们的方法在Tinyimagenet数据集上对最新的方法的表现较高19.8％。在Imagenet上，我们的Bivit在SWIN-S模型上实现了75.6％的TOP-1准确性。此外，在可可对象检测上，我们的方法在喀斯喀特蒙版R-CNN框架上获得了40.8的地图。

Model binarization can significantly compress model size, reduce energy consumption, and accelerate inference through efficient bit-wise operations. Although binarizing convolutional neural networks have been extensively studied, there is little work on exploring binarization of vision Transformers which underpin most recent breakthroughs in visual recognition. To this end, we propose to solve two fundamental challenges to push the horizon of Binary Vision Transformers (BiViT). First, the traditional binary method does not take the long-tailed distribution of softmax attention into consideration, bringing large binarization errors in the attention module. To solve this, we propose Softmax-aware Binarization, which dynamically adapts to the data distribution and reduces the error caused by binarization. Second, to better preserve the information of the pretrained model and restore accuracy, we propose a Cross-layer Binarization scheme that decouples the binarization of self-attention and multi-layer perceptrons (MLPs), and Parameterized Weight Scales which introduce learnable scaling factors for weight binarization. Overall, our method performs favorably against state-of-the-arts by 19.8% on the TinyImageNet dataset. On ImageNet, our BiViT achieves a competitive 75.6% Top-1 accuracy over Swin-S model. Additionally, on COCO object detection, our method achieves an mAP of 40.8 with a Swin-T backbone over Cascade Mask R-CNN framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题