论文标题
通过层次量化分析探索神经网络量化
Exploring Neural Networks Quantization via Layer-Wise Quantization Analysis
论文作者
论文摘要
量化是有效地部署深度学习模型的重要一步,因此是一个越来越流行的研究主题。当前文献中未解决的一个重要的实际方面是如何分析和修复使用量化导致过度降解的失败情况。在本文中,我们提出了一个简单的分析框架,该框架将整体降级分解为其每层贡献。我们分析了许多通用网络,并观察到层的贡献是由固有的(局部)因素(局部)因素(层的重量和激活的分布)决定的,以及与与其他层的相互作用有关的外部(全局)因素。对现有量化方案的层面分析揭示了现有技术的本地故障案例,这些故障案例在检查其整体性能时不会反映出。例如,我们考虑resnext26在哪些SOTA训练后量化方法上的性能较差。我们表明,几乎所有的降解都来自单层。相同的分析还允许局部修复 - 仅在该层上应用常见的缩减启发式启发式,可以将降解降至最低,同时应用相同的全球启发式启发式,从而导致高降解。更一般而言,层分析可以更细微地检查量化如何影响网络,从而实现更好的性能方案的设计。
Quantization is an essential step in the efficient deployment of deep learning models and as such is an increasingly popular research topic. An important practical aspect that is not addressed in the current literature is how to analyze and fix fail cases where the use of quantization results in excessive degradation. In this paper, we present a simple analytic framework that breaks down overall degradation to its per layer contributions. We analyze many common networks and observe that a layer's contribution is determined by both intrinsic (local) factors - the distribution of the layer's weights and activations - and extrinsic (global) factors having to do with the the interaction with the rest of the layers. Layer-wise analysis of existing quantization schemes reveals local fail-cases of existing techniques which are not reflected when inspecting their overall performance. As an example, we consider ResNext26 on which SoTA post-training quantization methods perform poorly. We show that almost all of the degradation stems from a single layer. The same analysis also allows for local fixes - applying a common weight clipping heuristic only to this layer reduces degradation to a minimum while applying the same heuristic globally results in high degradation. More generally, layer-wise analysis allows for a more nuanced examination of how quantization affects the network, enabling the design of better performing schemes.