夹子本身是一个强大的微型调节器：使用Imagenet上的VIT-B和VIT-L可实现85.7％和88.0％的TOP-1精度

论文标题

夹子本身是一个强大的微型调节器：使用Imagenet上的VIT-B和VIT-L可实现85.7％和88.0％的TOP-1精度

CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet

论文作者

Dong, Xiaoyi, Bao, Jianmin, Zhang, Ting, Chen, Dongdong, Gu, Shuyang, Zhang, Weiming, Yuan, Lu, Chen, Dong, Wen, Fang, Yu, Nenghai

论文摘要

最近的研究表明，剪辑在执行零击中的推断方面取得了巨大的成功，而其微调性能并不令人满意。在本文中，我们确定了超参数选择会显着影响微调的性能。我们通过一项全面研究检查了各种关键的超参数，并经验评估它们在微调剪辑中的影响。我们发现，剪辑的微调性能被大大低估了。与大规模监督的预训练方法或最新作品相比，配备了高参数精炼，我们证明夹子本身在微调中更好或至少具有竞争力，或者使用剪辑作为掩盖图像建模的预测目标。具体而言，在Imagenet-1k数据集上，夹子vit-base/16和夹子vit-large/14可以达到85.7％，芬太尼的冠状top-1精度为88.0％。这些观察结果挑战了传统的结论，即剪辑不适合微调，并激励我们重新考虑基于剪辑的改进。我们将通过\ url {https://github.com/lightdxy/ft-clip}公开发布代码。

Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset . These observations challenge the conventional conclusion that CLIP is not suitable for fine-tuning, and motivate us to rethink recently proposed improvements based on CLIP. We will release our code publicly at \url{https://github.com/LightDXY/FT-CLIP}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题