没有留下的令牌：解释性辅助图像分类和发电

论文标题

没有留下的令牌：解释性辅助图像分类和发电

No Token Left Behind: Explainability-Aided Image Classification and Generation

论文作者

Paiss, Roni, Chefer, Hila, Wolf, Lior

论文摘要

通过使用图像文本匹配模型，零射击学习在计算机视觉中的应用已彻底改变了。最值得注意的示例，夹子已被广泛用于带有文本提示的零弹药分类和指导生成模型。但是，相对于输入文本的措辞，零摄像的使用是不稳定的，因此有必要仔细设计所用的提示。我们发现这种不稳定性源于选择性相似性得分，该得分仅基于语义上有意义的输入令牌的子集。为了减轻它，我们提出了一种新颖的基于可解释的方法，该方法增加了一个损失术语，以确保剪辑专注于输入的所有相关语义部分，此外还使用了以前的作品中使用的夹子相似性损失。当通过及时的工程应用于单发分类时，我们的方法可以提高识别率，而无需进行其他培训或微调。此外，我们表明使用我们的方法对生成模型的剪辑指导显着改善了生成的图像。最后，我们通过在对象位置进行空间调节的剪辑指南对基于文本的图像生成的新颖使用，需要将图像解释性热图限制在预定的边界框中。

The application of zero-shot learning in computer vision has been revolutionized by the use of image-text matching models. The most notable example, CLIP, has been widely used for both zero-shot classification and guiding generative models with a text prompt. However, the zero-shot use of CLIP is unstable with respect to the phrasing of the input text, making it necessary to carefully engineer the prompts used. We find that this instability stems from a selective similarity score, which is based only on a subset of the semantically meaningful input tokens. To mitigate it, we present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input, in addition to employing the CLIP similarity loss used in previous works. When applied to one-shot classification through prompt engineering, our method yields an improvement in the recognition rate, without additional training or fine-tuning. Additionally, we show that CLIP guidance of generative models using our method significantly improves the generated images. Finally, we demonstrate a novel use of CLIP guidance for text-based image generation with spatial conditioning on object location, by requiring the image explainability heatmap for each object to be confined to a pre-determined bounding box.

下载PDF全文

下载文献需遵守相关版权规定

论文标题