论文标题
收缩张量阵列:移动CNN推断的有效的结构化sparse Gemm加速器
Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference
论文作者
论文摘要
移动设备上的卷积神经网络(CNN)推断需要有效的低精度(INT8)常规矩阵乘法(GEMM)的硬件加速。收缩期阵列(SA)是管道上的2D处理元件(PES),具有非常有效的本地数据运动,非常适合加速GEMM,并且在行业中广泛部署。在这项工作中,我们描述了对传统SA体系结构的两个重大改进,以专门针对CNN推断进行优化。首先,我们将传统的标量PE推广到张量PE中,从而产生了一个新的收缩期张量阵列(STA)微体系结构。与带有INT8操作数的常规SA相比,STA家族分别提高了PE Intra Intra Intra Intra Intra Intra family效率,从而导致电路面积和功率耗散降低高达2.08倍和1.36倍。其次,我们扩展了此设计,以支持一种新型的块Sparse数据格式,称为密度结合块(DBB)。该变体(STA-DBB)在处理特殊训练的DBB-SPARSE模型时,分别在ISO-Thoughput上的SA基线分别取得了3.14倍和1.97倍的改善,同时使其与密集模型完全向后兼容。
Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM). The systolic array (SA) is a pipelined 2D array of processing elements (PEs), with very efficient local data movement, well suited to accelerating GEMM, and widely deployed in industry. In this work, we describe two significant improvements to the traditional SA architecture, to specifically optimize for CNN inference. Firstly, we generalize the traditional scalar PE, into a Tensor-PE, which gives rise to a family of new Systolic Tensor Array (STA) microarchitectures. The STA family increases intra-PE operand reuse and datapath efficiency, resulting in circuit area and power dissipation reduction of as much as 2.08x and 1.36x respectively, compared to the conventional SA at iso-throughput with INT8 operands. Secondly, we extend this design to support a novel block-sparse data format called density-bound block (DBB). This variant (STA-DBB) achieves a 3.14x and 1.97x improvement over the SA baseline at iso-throughput in area and power respectively, when processing specially-trained DBB-sparse models, while remaining fully backwards compatible with dense models.