论文标题
HAWQV3:二元神经网络量化
HAWQV3: Dyadic Neural Network Quantization
论文作者
论文摘要
当前的低精度量化算法通常从浮点到量化整数值的来回转换的隐藏成本。这种隐藏的成本限制了通过量化神经网络实现的潜伏期改善。为了解决这个问题,我们提出了HAWQV3,这是一种新型的混合精确整数量化框架。 HAWQV3的贡献如下:(i)仅整数推断,其中整个计算图仅通过整数乘法,添加和位移动执行,而没有任何浮点操作,甚至没有任何浮点数; (ii)一种新颖的硬件意识混合精确量化方法,其中通过求解整数线性编程问题来计算位点,该问题可以平衡模型扰动与其他约束之间的权衡,例如内存足迹和潜伏期; (iii)TVM中4位均匀/混合精液量化的直接硬件部署和开源贡献,与均匀的4位相比,T4 GPUS的RESNET50的均匀8位的平均速度为$ 1.45 \ times $ $; (iv)对RESNET18/50和InceptionV3上提出的方法的广泛评估,对具有/不混合精度的各种模型压缩水平进行了介绍。对于RESNET50,我们的INT8量化的准确度为$ 77.58 \%$,比仅限整数工作的$ 2.68 \%$ $ $ $ $ $ $ $ $ 2.68 \%$,并且我们的混合精制INT4/8量化可以将INT8潜伏期降低$ 23 \%$,并且仍然达到$ 76.73 \%$ $ $ $ $ $ $。我们的框架和TVM实施是开源的。
Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. This hidden cost limits the latency improvement realized by quantizing Neural Networks. To address this, we present HAWQV3, a novel mixed-precision integer-only quantization framework. The contributions of HAWQV3 are the following: (i) An integer-only inference where the entire computational graph is performed only with integer multiplication, addition, and bit shifting, without any floating point operations or even integer division; (ii) A novel hardware-aware mixed-precision quantization method where the bit-precision is calculated by solving an integer linear programming problem that balances the trade-off between model perturbation and other constraints, e.g., memory footprint and latency; (iii) Direct hardware deployment and open source contribution for 4-bit uniform/mixed-precision quantization in TVM, achieving an average speed up of $1.45\times$ for uniform 4-bit, as compared to uniform 8-bit for ResNet50 on T4 GPUs; and (iv) extensive evaluation of the proposed methods on ResNet18/50 and InceptionV3, for various model compression levels with/without mixed precision. For ResNet50, our INT8 quantization achieves an accuracy of $77.58\%$, which is $2.68\%$ higher than prior integer-only work, and our mixed-precision INT4/8 quantization can reduce INT8 latency by $23\%$ and still achieve $76.73\%$ accuracy. Our framework and the TVM implementation have been open sourced.