论文标题
KD-DRE:用于检测变压器的知识蒸馏,具有一致的蒸馏点采样
KD-DETR: Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling
论文作者
论文摘要
DEDR是一种新颖的端到端变压器体系结构对象检测器,在扩展时大大优于经典检测器。在本文中,我们专注于通过知识蒸馏的DEDR压缩。尽管知识蒸馏在经典探测器中进行了充分研究,但缺乏有关如何使其在DETR上有效工作的研究。我们首先提供实验和理论分析,以指出DETR蒸馏的主要挑战是缺乏一致的蒸馏点。蒸馏点是指学生对模拟学生的相应预测输入,这些预测在CNN检测器和DETR中具有不同的配方,可靠的蒸馏需要足够的蒸馏点,而教师和学生之间是一致的。 基于这一观察结果,我们提出了针对均质和异质蒸馏的第一个具有一致的蒸馏点采样的DEDR(KD-DETR)(KD-DETR)的范式。具体而言,我们通过引入一组专门的对象查询来构造DETR的蒸馏点来解除检测和蒸馏任务。我们进一步提出了一种一般到特定的蒸馏点采样策略,以探索KD-DRE的扩展性。广泛的实验验证了KD-DRE的有效性和概括。对于单尺度DAB-DAB和多尺度可变形的Detr和Dino,KD-Detr提高了学生模型的性能,并提高了2.6 \%-5.2 \%$。我们进一步将KD-DETR扩展到异质蒸馏,并通过用Resnet-50将知识从Dino蒸馏到更快的R-CNN来获得$ 2.1 \%$的改进,这与均匀蒸馏方法相当。
DETR is a novel end-to-end transformer architecture object detector, which significantly outperforms classic detectors when scaling up. In this paper, we focus on the compression of DETR with knowledge distillation. While knowledge distillation has been well-studied in classic detectors, there is a lack of researches on how to make it work effectively on DETR. We first provide experimental and theoretical analysis to point out that the main challenge in DETR distillation is the lack of consistent distillation points. Distillation points refer to the corresponding inputs of the predictions for student to mimic, which have different formulations in CNN detector and DETR, and reliable distillation requires sufficient distillation points which are consistent between teacher and student. Based on this observation, we propose the first general knowledge distillation paradigm for DETR (KD-DETR) with consistent distillation points sampling, for both homogeneous and heterogeneous distillation. Specifically, we decouple detection and distillation tasks by introducing a set of specialized object queries to construct distillation points for DETR. We further propose a general-to-specific distillation points sampling strategy to explore the extensibility of KD-DETR. Extensive experiments validate the effectiveness and generalization of KD-DETR. For both single-scale DAB-DETR and multis-scale Deformable DETR and DINO, KD-DETR boost the performance of student model with improvements of $2.6\%-5.2\%$. We further extend KD-DETR to heterogeneous distillation, and achieves $2.1\%$ improvement by distilling the knowledge from DINO to Faster R-CNN with ResNet-50, which is comparable with homogeneous distillation methods.The code is available at https://github.com/wennyuhey/KD-DETR.