论文标题
标签:通过文本感知的视觉提问 - 促进文本VQA的生成
TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation
论文作者
论文摘要
文本VQA旨在回答需要了解图像中文本提示的问题。尽管现有的文本VQA方法取得了长足的进步,但其性能仍遭受了人类标记的问题解答(QA)对不足。但是,我们观察到,通常在现有数据集中没有充分利用场景文本 - 每个图像中只有一小部分文本参与了带注释的QA活动。这导致大量有用的信息浪费。为了解决这一缺陷,我们通过明确利用每个图像的场景上下文中可用的现有丰富文本来开发一种新方法来生成高质量和不同的质量质量质量对。具体而言,我们建议,TAG是一种文本感知的视觉问题 - 答案生成的结构,该结构学会了使用多模式变压器生成有意义且准确的QA样本。该体系结构通过将生成的QA对与初始培训数据相结合,从而利用了未充满激光的场景文本信息,并增强了文本VQA模型的场景理解。对两个众所周知的Text-VQA基准(TextVQA和ST-VQA)的广泛实验结果表明,我们提议的标签有效地扩大了训练数据,有助于改善文本VQA性能而无需额外的标签努力。此外,我们的模型优于预先通过大型数据进行预训练的最先进方法。代码可在https://github.com/henryjunw/tag上找到。
Text-VQA aims at answering questions that require understanding the textual cues in an image. Despite the great progress of existing Text-VQA methods, their performance suffers from insufficient human-labeled question-answer (QA) pairs. However, we observe that, in general, the scene text is not fully exploited in the existing datasets -- only a small portion of the text in each image participates in the annotated QA activities. This results in a huge waste of useful information. To address this deficiency, we develop a new method to generate high-quality and diverse QA pairs by explicitly utilizing the existing rich text available in the scene context of each image. Specifically, we propose, TAG, a text-aware visual question-answer generation architecture that learns to produce meaningful, and accurate QA samples using a multimodal transformer. The architecture exploits underexplored scene text information and enhances scene understanding of Text-VQA models by combining the generated QA pairs with the initial training data. Extensive experimental results on two well-known Text-VQA benchmarks (TextVQA and ST-VQA) demonstrate that our proposed TAG effectively enlarges the training data that helps improve the Text-VQA performance without extra labeling effort. Moreover, our model outperforms state-of-the-art approaches that are pre-trained with extra large-scale data. Code is available at https://github.com/HenryJunW/TAG.