Hobflops CNN：卷积神经网络的硬件优化的BITSLICE-BITSLICE平行浮点操作

论文标题

Hobflops CNN：卷积神经网络的硬件优化的BITSLICE-BITSLICE平行浮点操作

HOBFLOPS CNNs: Hardware Optimized Bitslice-Parallel Floating-Point Operations for Convolutional Neural Networks

论文作者

Garland, James, Gregg, David

论文摘要

卷积神经网络（CNN）通常使用16或32位浮点（FP）进行训练，研究人员表明，低精确的浮点（FP）对于推断可能非常有效。低精度FP可以在现场可编程门数组（FPGA）和特定于应用程序的集成电路（ASIC）加速器中实现，但是现有处理器通常不支持自定义精度FP。我们提出了硬件优化的BITSLICE - 平行浮点运算符（Hobflops），这是一种生成有效的自定义定制模拟bitslice-parallel软件FP算术的方法。我们生成了使用硬件合成设计流进行优化的自定义FP例程以创建电路。我们提供与目标微处理器体系结构上的位操作相匹配的标准单元格库，并提供代码生成器，以将硬件电路转换为bitslice软件等效物。我们利用Bitslice并行性创建非常宽的（32-512个元素）矢量化卷积神经网络（CNN）卷积。将硬件优化优化的BITSLICE - 平行浮点运算符（HobFlops）与伯克利的SoftFP16等效Mac相比，在ARM和Intel处理器上的CNN卷积中的多重功能（MAC）性能。在英特尔AVX512上，HobFlops16优于8倍SOFTFP16。 Hobflops提供了任意推荐的FP，具有自定义范围和精度，例如，Hobflops9的表现为6倍，是Arm Neon上Hobflops16的性能。 Hobflops允许研究人员在软件CNN加速器的算术中原型定制的自定义FP精度。此外，在内存带宽有限的情况下，Hobflops快速定制的FP CNN可能很有价值。

Convolutional neural networks (CNNs) are typically trained using 16- or 32-bit floating-point (FP) and researchers show that low-precision floating-point (FP) can be highly effective for inference. Low-precision FP can be implemented in field programmable gate array (FPGA) and application-specific integrated circuit (ASIC) accelerators, but existing processors do not generally support custom precision FP. We propose hardware optimized bitslice-parallel floating-point operators (HOBFLOPS), a method of generating efficient custom-precision emulated bitslice-parallel software FP arithmetic. We generate custom-precision FP routines optimized using a hardware synthesis design flow to create circuits. We provide standard cell libraries matching the bitwise operations on the target microprocessor architecture, and a code-generator to translate the hardware circuits to bitslice software equivalents. We exploit bitslice parallelism to create a very wide (32-512 element) vectorized convolutional neural network (CNN) convolution. Hardware optimized bitslice-parallel floating-point operators (HOBFLOPS) multiply-accumulate (MAC) performance in CNN convolution on Arm and Intel processors are compared to Berkeley's SoftFP16 equivalent MAC. HOBFLOPS16 outperforms SoftFP16 by 8x on Intel AVX512. HOBFLOPS offers arbitrary-precision FP with custom range and precision e.g., HOBFLOPS9 performs at 6x the performance of HOBFLOPS16 on Arm Neon. HOBFLOPS allows researchers to prototype different levels of custom FP precision in the arithmetic of software CNN accelerators. Furthermore, HOBFLOPS fast custom-precision FP CNNs may be valuable in cases where memory bandwidth is limited.

下载PDF全文

下载文献需遵守相关版权规定

论文标题