树结构的辅助在线知识蒸馏

论文标题

树结构的辅助在线知识蒸馏

Tree-structured Auxiliary Online Knowledge Distillation

论文作者

Lin, Wenye, Li, Yangning, Ding, Yifeng, Zheng, Hai-Tao

论文摘要

传统知识蒸馏采用了一个两阶段的培训过程，在该过程中，教师模型进行了预训练，然后将知识转移到紧凑的学生模型中。为了克服限制，建议在老师不可用时进行在线知识蒸馏以进行一阶段的蒸馏。关于在线知识蒸馏的最新研究主要集中于蒸馏目标的设计，包括注意力或门机制。取而代之的是，在这项工作中，我们专注于全球体系结构的设计，并提出树结构的辅助在线知识蒸馏（TSA），这为接近输出层次的层次增加了更多的平行对等，以增强知识蒸馏的效果。不同的分支构建了输入的不同视图，这可能是知识的来源。层次结构意味着知识随着层的增长从一般到特定于任务的转移。对3个计算机视觉和4个自然语言处理数据集进行的广泛实验表明，我们的方法可以实现没有铃铛和哨声的最先进的性能。据我们所知，我们是第一个证明在线知识蒸馏机器翻译任务的有效性的人。

Traditional knowledge distillation adopts a two-stage training process in which a teacher model is pre-trained and then transfers the knowledge to a compact student model. To overcome the limitation, online knowledge distillation is proposed to perform one-stage distillation when the teacher is unavailable. Recent researches on online knowledge distillation mainly focus on the design of the distillation objective, including attention or gate mechanism. Instead, in this work, we focus on the design of the global architecture and propose Tree-Structured Auxiliary online knowledge distillation (TSA), which adds more parallel peers for layers close to the output hierarchically to strengthen the effect of knowledge distillation. Different branches construct different views of the inputs, which can be the source of the knowledge. The hierarchical structure implies that the knowledge transfers from general to task-specific with the growth of the layers. Extensive experiments on 3 computer vision and 4 natural language processing datasets show that our method achieves state-of-the-art performance without bells and whistles. To the best of our knowledge, we are the first to demonstrate the effectiveness of online knowledge distillation for machine translation tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题