模型压缩的多头知识蒸馏

论文标题

模型压缩的多头知识蒸馏

Multi-head Knowledge Distillation for Model Compression

论文作者

Wang, Huan, Lohit, Suhas, Jones, Michael, Fu, Yun

论文摘要

已经开发了几种知识蒸馏方法来进行神经网络压缩。尽管他们都使用KL分歧损失将学生模型的软输出与老师的软输出保持一致，但鼓励学生的中级特征的各种方法有所不同，以匹配老师的中级特征。在本文中，我们建议使用中级层的辅助分类器提出一种简单的实施方法，以匹配特征，我们称之为多头知识蒸馏（MHKD）。我们为培训学生添加损失条款，以衡量辅助分类器的学生和教师成绩之间的差异。同时，提出的方法还提供了一种自然的方法来测量中间层的差异，即使内部教师和学生特征的尺寸可能有所不同。通过多个数据集的图像分类实验，我们表明，所提出的方法优于文献中提出的相关方法。

Several methods of knowledge distillation have been developed for neural network compression. While they all use the KL divergence loss to align the soft outputs of the student model more closely with that of the teacher, the various methods differ in how the intermediate features of the student are encouraged to match those of the teacher. In this paper, we propose a simple-to-implement method using auxiliary classifiers at intermediate layers for matching features, which we refer to as multi-head knowledge distillation (MHKD). We add loss terms for training the student that measure the dissimilarity between student and teacher outputs of the auxiliary classifiers. At the same time, the proposed method also provides a natural way to measure differences at the intermediate layers even though the dimensions of the internal teacher and student features may be different. Through several experiments in image classification on multiple datasets we show that the proposed method outperforms prior relevant approaches presented in the literature.

下载PDF全文

下载文献需遵守相关版权规定

论文标题