论文标题
超越边界框:多模式知识学习用于对象检测
Beyond Bounding Box: Multimodal Knowledge Learning for Object Detection
论文作者
论文摘要
多模式监督在许多视觉语言理解任务中取得了令人鼓舞的结果,在许多视觉语言理解任务中,该语言作为识别和定位实例的提示或上下文起着至关重要的作用。但是,由于人类宣布的语言语料库的缺陷,在完全监督的对象检测方案中,多模式监督仍未探索。在本文中,我们利用语言提示将有效和公正的语言监督引入对象检测中,并提出了一种称为多模式知识学习的新机制(\ textbf {MKL}),这是从语言监督中学习知识所需的。具体来说,我们设计提示并用边界框注释填充它们,以生成包含广泛提示和上下文的描述,以识别和本地化。然后,通过在图像级和对象级中最大化跨模式的互信息,将来自语言的知识蒸馏到检测模型中。此外,对生成的描述进行操纵以产生艰苦的负面因素,以进一步提高检测器的性能。广泛的实验表明,所提出的方法可以使稳定的性能增益提高1.6 \%$ \ sim $ 2.1 \%,并在MS-Coco和OpenImage数据集上实现最新的功能。
Multimodal supervision has achieved promising results in many visual language understanding tasks, where the language plays an essential role as a hint or context for recognizing and locating instances. However, due to the defects of the human-annotated language corpus, multimodal supervision remains unexplored in fully supervised object detection scenarios. In this paper, we take advantage of language prompt to introduce effective and unbiased linguistic supervision into object detection, and propose a new mechanism called multimodal knowledge learning (\textbf{MKL}), which is required to learn knowledge from language supervision. Specifically, we design prompts and fill them with the bounding box annotations to generate descriptions containing extensive hints and context for instances recognition and localization. The knowledge from language is then distilled into the detection model via maximizing cross-modal mutual information in both image- and object-level. Moreover, the generated descriptions are manipulated to produce hard negatives to further boost the detector performance. Extensive experiments demonstrate that the proposed method yields a consistent performance gain by 1.6\% $\sim$ 2.1\% and achieves state-of-the-art on MS-COCO and OpenImages datasets.