VWA：卷积神经网络的硬件有效矢量加速器

论文标题

VWA：卷积神经网络的硬件有效矢量加速器

VWA: Hardware Efficient Vectorwise Accelerator for Convolutional Neural Network

论文作者

Chang, Kuo-Wei, Chang, Tian-Sheuan

论文摘要

卷积神经网络（CNNS）的硬件加速器实现了人工智能技术的实时应用。但是，由于复杂的数据流，大多数现有设计都遭受了低硬件利用率或高面积成本的影响。本文提出了一个硬件有效的矢量CNN加速器，该加速器采用3 $ \ times $ 3使用1-D广播数据流来生成部分总和。这可以轻松重新配置具有交错输入或元素输入数据流的不同种类的内核。这种简单且常规的数据流导致面积较低的成本，同时获得了高硬件利用率。提出的设计分别达到99 \％，97 \％，93.7 \％，94 \％的硬件利用率，分别用于VGG-16，Resnet-34，Googlenet和Mobilenet。 TSMC 40NM技术的硬件实施需要266.9k NAND GATE数和191KB SRAM，以支持168 gops吞吐量，并且在500MHz操作频率下运行时仅消耗154.98MW，其面积和功率效率高于其他设计。

Hardware accelerators for convolution neural networks (CNNs) enable real-time applications of artificial intelligence technology. However, most of the existing designs suffer from low hardware utilization or high area cost due to complex dataflow. This paper proposes a hardware efficient vectorwise CNN accelerator that adopts a 3$\times$3 filter optimized systolic array using 1-D broadcast dataflow to generate partial sum. This enables easy reconfiguration for different kinds of kernels with interleaved input or elementwise input dataflow. This simple and regular data flow results in low area cost while attains high hardware utilization. The presented design achieves 99\%, 97\%, 93.7\%, 94\% hardware utilization for VGG-16, ResNet-34, GoogLeNet, and Mobilenet, respectively. Hardware implementation with TSMC 40nm technology takes 266.9K NAND gate count and 191KB SRAM to support 168GOPS throughput and consumes only 154.98mW when running at 500MHz operating frequency, which has superior area and power efficiency than other designs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题