论文标题

通过场景文本进行知识挖掘,以获得细粒度的识别

Knowledge Mining with Scene Text for Fine-Grained Recognition

论文作者

Wang, Hao, Liao, Junchao, Cheng, Tianheng, Gao, Zewen, Liu, Hao, Ren, Bo, Bai, Xiang, Liu, Wenyu

论文摘要

最近,事实证明,场景文本的语义对于细粒度的图像分类至关重要。但是,现有方法主要利用场景文本的字面意义进行细粒度识别,当它与对象/场景没有显着相关时,这可能是无关紧要的。我们提出了一个端到端的可训练网络,该网络在场景文本图像背后挖掘隐式上下文知识,并增强语义和相关性,以微调图像表示。与现有方法不同,我们的模型集成了三种方式:视觉特征提取,文本语义提取以及将背景知识与细粒度的图像分类相关联。具体而言,我们采用知识群来检索相关的语义表示知识,并将其与图像特征相结合以进行细粒分类。在两个基准数据集(Con-Text和Drink Bottle)上进行的实验表明,我们的方法的表现分别优于3.72 \%地图和5.39 \%地图。为了进一步验证所提出的方法的有效性,我们为评估创建了一个有关人群活动识别的新数据集。该工作的源代码和新数据集可在https://github.com/lanfeng4659/knowledgeminingwithscenetext上获得。

Recently, the semantics of scene text has been proven to be essential in fine-grained image classification. However, the existing methods mainly exploit the literal meaning of scene text for fine-grained recognition, which might be irrelevant when it is not significantly related to objects/scenes. We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image and enhance the semantics and correlation to fine-tune the image representation. Unlike the existing methods, our model integrates three modalities: visual feature extraction, text semantics extraction, and correlating background knowledge to fine-grained image classification. Specifically, we employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification. Experiments on two benchmark datasets, Con-Text, and Drink Bottle, show that our method outperforms the state-of-the-art by 3.72\% mAP and 5.39\% mAP, respectively. To further validate the effectiveness of the proposed method, we create a new dataset on crowd activity recognition for the evaluation. The source code and new dataset of this work are available at https://github.com/lanfeng4659/KnowledgeMiningWithSceneText.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源