论文标题
使用RADIX-4梯度进行内存中计算硬件的神经网络培训
Neural Network Training on In-memory-computing Hardware with Radix-4 Gradients
论文作者
论文摘要
深度学习训练涉及大量操作,这些操作由高维矩阵矢量乘以(MVM)主导。这促使硬件加速器提高了计算效率,但是在数据移动和访问的地方被证明是关键的瓶颈。内存计算(IMC)是一种有可能克服这一点的方法,从而在密集的2-D存储器内进行计算。但是,IMC从根本上进行了效率和吞吐量的增长,以获得动态范围的局限性,从而对训练面临着明显的挑战,在这种情况下,计算精度要求大大高于推论。本文通过利用最近的两个进展来探讨IMC硬件的培训:(1)一种培训算法,通过Radix-4数字表示可以进行积极的量化; (2)基于精确电容器的IMC利用计算,从而使模拟噪声效应远低于量化效应。能量建模校准了16NM CMO中实现的测得的硅原型表明,可以通过完全量化的适应性来实现超过400倍的能量节省,其中所有训练MVM都可以映射到IMC,并且可以实现3倍的3倍适用于两级量化器适应性,其中三个训练MVM中的两个可以映射到IMC上。
Deep learning training involves a large number of operations, which are dominated by high dimensionality Matrix-Vector Multiplies (MVMs). This has motivated hardware accelerators to enhance compute efficiency, but where data movement and accessing are proving to be key bottlenecks. In-Memory Computing (IMC) is an approach with the potential to overcome this, whereby computations are performed in-place within dense 2-D memory. However, IMC fundamentally trades efficiency and throughput gains for dynamic-range limitations, raising distinct challenges for training, where compute precision requirements are seen to be substantially higher than for inferencing. This paper explores training on IMC hardware by leveraging two recent developments: (1) a training algorithm enabling aggressive quantization through a radix-4 number representation; (2) IMC leveraging compute based on precision capacitors, whereby analog noise effects can be made well below quantization effects. Energy modeling calibrated to a measured silicon prototype implemented in 16nm CMOS shows that energy savings of over 400x can be achieved with full quantizer adaptability, where all training MVMs can be mapped to IMC, and 3x can be achieved for two-level quantizer adaptability, where two of the three training MVMs can be mapped to IMC.