图像在图像中说话：一名通才画家，用于视觉学习

论文标题

图像在图像中说话：一名通才画家，用于视觉学习

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

论文作者

Wang, Xinlong, Wang, Wen, Cao, Yue, Shen, Chunhua, Huang, Tiejun

论文摘要

作为NLP中的新范式，内在的学习允许该模型仅通过少数提示和示例快速适应各种任务。但是，在计算机视觉中，在输出表示中，内在学习的困难在于该任务的差异很大，因此尚不清楚如何定义通用任务提示视力模型可以理解并转移到室外任务。在这项工作中，我们提出了Painter，这是一种通才模型，它以“图像”中心解决方案来解决这些障碍，也就是说，将核心视觉任务的输出重新定义为图像，并将任务提示指定为图像。有了这个想法，我们的训练过程非常简单，它在输入和输出图像对的针迹上执行标准的蒙版图像建模。这使得模型能够执行以可见图像贴片为条件的任务。因此，在推断期间，我们可以从与输入条件相同的任务中采用一对输入和输出图像，以指示要执行的任务。与良好的特定任务模型相比，我们的通才画家没有铃铛和口哨声，在七个代表性的视觉任务上，从高级的视觉理解到低级图像处理。此外，画家在几个具有挑战性的任务上大大优于最近的通才模型。

In-context learning, as a new paradigm in NLP, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. But in computer vision, the difficulties for in-context learning lie in that tasks vary significantly in the output representations, thus it is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. In this work, we present Painter, a generalist model which addresses these obstacles with an "image"-centric solution, that is, to redefine the output of core vision tasks as images, and specify task prompts as also images. With this idea, our training process is extremely simple, which performs standard masked image modeling on the stitch of input and output image pairs. This makes the model capable of performing tasks conditioned on visible image patches. Thus, during inference, we can adopt a pair of input and output images from the same task as the input condition, to indicate which task to perform. Without bells and whistles, our generalist Painter can achieve competitive performance compared to well-established task-specific models, on seven representative vision tasks ranging from high-level visual understanding to low-level image processing. In addition, Painter significantly outperforms recent generalist models on several challenging tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题