了解和减轻过度适应视觉模型的迅速调整

论文标题

了解和减轻过度适应视觉模型的迅速调整

Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models

论文作者

Ma, Chengcheng, Liu, Yang, Deng, Jiankang, Xie, Lingxi, Dong, Weiming, Xu, Changsheng

论文摘要

审慎的视觉模型（VLM）（例如剪辑）在下游视觉任务中显示了具有适当文本提示的下游视觉任务的令人印象深刻的概括能力。最近已提出上下文优化（COP），而不是手动设计提示，以使用任务特定的培训数据来学习连续的提示。尽管下游任务的性能提高了，但一些研究报告说，Coop在两个方面都遭受了过度拟合的问题：（i）基本类别的测试准确性首先提高，然后在培训期间恶化；（ii）新颖类的测试准确性不断下降。但是，现有的研究都无法理解和减轻这种过度拟合的问题。在这项研究中，我们首先通过分析梯度流量来探讨过度拟合的原因。比较实验表明，Coop分别在早期和后期的训练阶段有利于概括和虚假特征，从而导致不过道和过度拟合现象。鉴于这些观察结果，我们建议子空间提示（subpt）将反向传播的梯度投射到整个训练过程中早期梯度流量特征向量向下跨越的低率子空间，并成功地消除了过度拟合的问题。此外，我们为Coop配备了新型功能学习者（NFL），以增强学习提示的概括能力，超出训练集之外的新型类别，不用图像训练数据。在11个分类数据集上进行的广泛实验表明，Subpt+NFL始终提高Coop的性能并超越最先进的Cocoop方法。对更具挑战性的下游任务的实验，包括开放式视频对象检测和零射击语义分割，还验证了所提出方法的有效性。代码可以在https://tinyurl.com/mpe64f89上找到。

Pretrained vision-language models (VLMs) such as CLIP have shown impressive generalization capability in downstream vision tasks with appropriate text prompts. Instead of designing prompts manually, Context Optimization (CoOp) has been recently proposed to learn continuous prompts using taskspecific training data. Despite the performance improvements on downstream tasks, several studies have reported that CoOp suffers from the overfitting issue in two aspects: (i) the test accuracy on base classes first improves and then worsens during training;(ii) the test accuracy on novel classes keeps decreasing. However, none of the existing studies can understand and mitigate such overfitting problems. In this study, we first explore the cause of overfitting by analyzing the gradient flow. Comparative experiments reveal that CoOp favors generalizable and spurious features in the early and later training stages, respectively, leading to the non-overfitting and overfitting phenomena. Given those observations, we propose Subspace Prompt Tuning (SubPT) to project the gradients in back-propagation onto the low-rank subspace spanned by the early-stage gradient flow eigenvectors during the entire training process and successfully eliminate the overfitting problem. In addition, we equip CoOp with a Novel Feature Learner (NFL) to enhance the generalization ability of the learned prompts onto novel categories beyond the training set, needless of image training data. Extensive experiments on 11 classification datasets demonstrate that SubPT+NFL consistently boost the performance of CoOp and outperform the state-of-the-art CoCoOp approach. Experiments on more challenging vision downstream tasks, including open-vocabulary object detection and zero-shot semantic segmentation, also verify the effectiveness of the proposed method. Codes can be found at https://tinyurl.com/mpe64f89.

下载PDF全文

下载文献需遵守相关版权规定

论文标题