论文标题
夹子本身是一个强大的微型调节器:使用Imagenet上的VIT-B和VIT-L可实现85.7%和88.0%的TOP-1精度
CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet
论文作者
论文摘要
最近的研究表明,剪辑在执行零击中的推断方面取得了巨大的成功,而其微调性能并不令人满意。在本文中,我们确定了超参数选择会显着影响微调的性能。我们通过一项全面研究检查了各种关键的超参数,并经验评估它们在微调剪辑中的影响。我们发现,剪辑的微调性能被大大低估了。与大规模监督的预训练方法或最新作品相比,配备了高参数精炼,我们证明夹子本身在微调中更好或至少具有竞争力,或者使用剪辑作为掩盖图像建模的预测目标。具体而言,在Imagenet-1k数据集上,夹子vit-base/16和夹子vit-large/14可以达到85.7%,芬太尼的冠状top-1精度为88.0%。这些观察结果挑战了传统的结论,即剪辑不适合微调,并激励我们重新考虑基于剪辑的改进。我们将通过\ url {https://github.com/lightdxy/ft-clip}公开发布代码。
Recent studies have shown that CLIP has achieved remarkable success in performing zero-shot inference while its fine-tuning performance is not satisfactory. In this paper, we identify that fine-tuning performance is significantly impacted by hyper-parameter choices. We examine various key hyper-parameters and empirically evaluate their impact in fine-tuning CLIP for classification tasks through a comprehensive study. We find that the fine-tuning performance of CLIP is substantially underestimated. Equipped with hyper-parameter refinement, we demonstrate CLIP itself is better or at least competitive in fine-tuning compared with large-scale supervised pre-training approaches or latest works that use CLIP as prediction targets in Masked Image Modeling. Specifically, CLIP ViT-Base/16 and CLIP ViT-Large/14 can achieve 85.7%,88.0% finetuning Top-1 accuracy on the ImageNet-1K dataset . These observations challenge the conventional conclusion that CLIP is not suitable for fine-tuning, and motivate us to rethink recently proposed improvements based on CLIP. We will release our code publicly at \url{https://github.com/LightDXY/FT-CLIP}.