论文标题
在ARMV8体系结构上进行有效的深度卷积
Towards Effective Depthwise Convolutions on ARMv8 Architecture
论文作者
论文摘要
深度卷积被广泛用于轻质卷积神经网络(CNN)。深度卷积的性能主要是由内存访问而不是经典卷积的算术操作界定,因此直接算法通常比间接算法(矩阵乘法 - ,Winograd-基于FFT和基于FFT的卷积)更有效,并具有其他内存访问。但是,在ARMV8体系结构上进行深度卷积的现有直接实现的表现是在不同张量的寄存器级重复使用之间进行了不良权衡,这通常会导致次优性能。在本文中,我们通过隐式填充,注册瓷砖等提出了深度卷积的新直接实现,其中包含正向传播,向后传播和重量梯度更新过程。与现有的实现相比,我们的新实施可能会在寄存器和缓存之间产生的沟通开销要少得多。两个ARMV8 CPU的实验结果表明,我们的实现可以平均提供4.88倍和16.4倍的性能改进,而在Pytorch中基于开源库和基于矩阵乘法的现有直接库中的性能提高。
Depthwise convolutions are widely used in lightweight convolutional neural networks (CNNs). The performance of depthwise convolutions is mainly bounded by the memory access rather than the arithmetic operations for classic convolutions so that direct algorithms are often more efficient than indirect ones (matrix multiplication-, Winograd-, and FFT-based convolutions) with additional memory accesses. However, the existing direct implementations of depthwise convolutions on ARMv8 architectures feature a bad trade-off between register-level reuse of different tensors, which usually leads to sub-optimal performance. In this paper, we propose new direct implementations of depthwise convolutions by means of implicit padding, register tiling, etc., which contain forward propagation, backward propagation and weight gradient update procedures. Compared to the existing ones, our new implementations can incur much less communication overhead between registers and cache. Experimental results on two ARMv8 CPUs show that our implementations can averagely deliver 4.88x and 16.4x performance improvement over the existing direct ones in open source libraries and matrix multiplications-based ones in Pytorch, respectively.