论文标题
部分可观测时空混沌系统的无模型预测
OSIC: A New One-Stage Image Captioner Coined
论文作者
论文摘要
主流图像字幕模型通常是两个阶段的字幕器,即通过预训练的检测器来计算对象特征,并将其馈入语言模型以生成文本描述。但是,这样的操作将导致基于任务的信息差距降低性能,因为检测任务中的对象特征是次优表示,并且无法为后续文本生成提供所有必要的信息。此外,对象特征通常由失去输入图像的本地细节的最后一层特征表示。在本文中,我们提出了一个具有动态多远程学习的新型单阶段图像字幕(OSIC),该学习将输入图像直接转换为一个阶段的描述性句子。结果,可以大大缩小基于任务的信息差距。为了获得丰富的功能,我们使用Swin Transformer计算多级特征,然后将它们馈入新型的动态多透视嵌入模块,以利用输入图像的全局结构和本地纹理。为了增强标题编码器的全局建模,我们提出了一个新的双维炼油模块,以非局部对嵌入式特征的相互作用进行建模。最后,OSIC可以获取丰富而有用的信息来改进图像标题任务。基准MS-COCO数据集的广泛比较验证了我们方法的出色性能。
Mainstream image caption models are usually two-stage captioners, i.e., calculating object features by pre-trained detector, and feeding them into a language model to generate text descriptions. However, such an operation will cause a task-based information gap to decrease the performance, since the object features in detection task are suboptimal representation and cannot provide all necessary information for subsequent text generation. Besides, object features are usually represented by the last layer features that lose the local details of input images. In this paper, we propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning, which directly transforms input image into descriptive sentences in one stage. As a result, the task-based information gap can be greatly reduced. To obtain rich features, we use the Swin Transformer to calculate multi-level features, and then feed them into a novel dynamic multi-sight embedding module to exploit both global structure and local texture of input images. To enhance the global modeling of encoder for caption, we propose a new dual-dimensional refining module to non-locally model the interaction of the embedded features. Finally, OSIC can obtain rich and useful information to improve the image caption task. Extensive comparisons on benchmark MS-COCO dataset verified the superior performance of our method.