通过场景文本进行知识挖掘，以获得细粒度的识别

论文标题

通过场景文本进行知识挖掘，以获得细粒度的识别

Knowledge Mining with Scene Text for Fine-Grained Recognition

论文作者

Wang, Hao, Liao, Junchao, Cheng, Tianheng, Gao, Zewen, Liu, Hao, Ren, Bo, Bai, Xiang, Liu, Wenyu

论文摘要

最近，事实证明，场景文本的语义对于细粒度的图像分类至关重要。但是，现有方法主要利用场景文本的字面意义进行细粒度识别，当它与对象/场景没有显着相关时，这可能是无关紧要的。我们提出了一个端到端的可训练网络，该网络在场景文本图像背后挖掘隐式上下文知识，并增强语义和相关性，以微调图像表示。与现有方法不同，我们的模型集成了三种方式：视觉特征提取，文本语义提取以及将背景知识与细粒度的图像分类相关联。具体而言，我们采用知识群来检索相关的语义表示知识，并将其与图像特征相结合以进行细粒分类。在两个基准数据集（Con-Text和Drink Bottle）上进行的实验表明，我们的方法的表现分别优于3.72 \％地图和5.39 \％地图。为了进一步验证所提出的方法的有效性，我们为评估创建了一个有关人群活动识别的新数据集。该工作的源代码和新数据集可在https://github.com/lanfeng4659/knowledgeminingwithscenetext上获得。

Recently, the semantics of scene text has been proven to be essential in fine-grained image classification. However, the existing methods mainly exploit the literal meaning of scene text for fine-grained recognition, which might be irrelevant when it is not significantly related to objects/scenes. We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image and enhance the semantics and correlation to fine-tune the image representation. Unlike the existing methods, our model integrates three modalities: visual feature extraction, text semantics extraction, and correlating background knowledge to fine-grained image classification. Specifically, we employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification. Experiments on two benchmark datasets, Con-Text, and Drink Bottle, show that our method outperforms the state-of-the-art by 3.72\% mAP and 5.39\% mAP, respectively. To further validate the effectiveness of the proposed method, we create a new dataset on crowd activity recognition for the evaluation. The source code and new dataset of this work are available at https://github.com/lanfeng4659/KnowledgeMiningWithSceneText.

下载PDF全文

下载文献需遵守相关版权规定

论文标题