有效VLM：通过知识蒸馏和模态自适应修剪快速准确的视觉语言模型

论文标题

有效VLM：通过知识蒸馏和模态自适应修剪快速准确的视觉语言模型

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

论文作者

Wang, Tiannan, Zhou, Wangchunshu, Zeng, Yan, Zhang, Xinsong

论文摘要

预训练的视觉模型（VLM）在一系列视觉任务中取得了令人印象深刻的结果。但是，流行的VLM通常由数亿个参数组成，由于空间，内存和延迟限制，在现实世界应用中为微观调整和部署带来了挑战。在这项工作中，我们介绍了一个蒸馏，然后修剪框架，将大型视觉模型压缩为较小，更快，更准确的框架。我们首先收缩了预先训练的大VLM的大小，并在视觉前训练阶段施加知识蒸馏以获得任务不合时宜的紧凑型VLM。然后，我们提出一种模态自适应修剪算法，以自动推断视力和语言模式对于不同的下游任务的重要性，并自适应地删除具有可控目标稀疏性的不同编码器中的冗余结构和神经元。我们将框架应用于训练ExtricVLM，这是一个快速准确的视觉语言模型，该模型由6个视觉层，3个文本层和3个跨模式融合层组成，总共仅占9300万个参数，占教师模型的44.3％。 EfficityVLM保留了教师模型的98.4％的表现，并将其推理速度提高2.2倍。有效VLM通过在各种视力语言任务上的较大边缘（包括VQAV2（ +4.9％）（ +4.9％），NLVR2（ +5.6％），ITR（r@1@r@1上的 +17.2％），IR +17.6％和coco catterion ainight Aniight a Aniight a Aniight a A A I Ighter Aniight a A A I Ighter a A A I Ighter Aniight +6，有效的VQAV2（ +4.9％），NLVR2（ +5.6％），在各种视觉范围内获得了相似尺寸的高效VLM的绝对绝对改善。

Pre-trained vision-language models (VLMs) have achieved impressive results in a range of vision-language tasks. However, popular VLMs usually consist of hundreds of millions of parameters which brings challenges for fine-tuning and deployment in real-world applications due to space, memory, and latency constraints. In this work, we introduce a distilling then pruning framework to compress large vision-language models into smaller, faster, and more accurate ones. We first shrink the size of a pre-trained large VLM and apply knowledge distillation in the vision-language pre-training stage to obtain a task-agnostic compact VLM. Then we propose a modal-adaptive pruning algorithm to automatically infer the importance of vision and language modalities for different downstream tasks and adaptively remove redundant structures and neurons in different encoders with controllable target sparsity. We apply our framework to train EfficientVLM, a fast and accurate vision-language model consisting of 6 vision layers, 3 text layers, and 3 cross-modal fusion layers, accounting for only 93 million parameters in total, which is 44.3% of the teacher model. EfficientVLM retains 98.4% performance of the teacher model and accelerates its inference speed by 2.2x. EfficientVLM achieves a large absolute improvement over previous SoTA efficient VLMs of similar sizes by a large margin on various vision-language tasks, including VQAv2 (+4.9%), NLVR2 (+5.6%), ITR (R@1 on TR +17.2%, on IR + 15.6% ) and COCO caption generation (CIDEr +6.5), demonstrating a large potential on training lightweight VLMs.

下载PDF全文

下载文献需遵守相关版权规定

论文标题