VQA在视觉变压器中的弱监督基础

论文标题

VQA在视觉变压器中的弱监督基础

Weakly Supervised Grounding for VQA in Vision-Language Transformers

论文作者

Khan, Aisha Urooj, Kuehne, Hilde, Gan, Chuang, Lobo, Niels Da Vitoria, Shah, Mubarak

论文摘要

用于视觉语言表示学习的变压器已经引起了很多兴趣，并且在视觉质疑和接地上表现出巨大的表现。但是，大多数表现出良好性能的系统在培训过程中仍然依赖于预训练的对象探测器，这将其适用性限制在可用于这些检测器的对象类中。为了减轻这一限制，以下论文着重于在变形金刚中的视觉问题回答的背景下进行弱监督的基础问题。该方法通过将每个视觉令牌分组在视觉编码器中，并使用语言自我发项层作为文本引导的选择模块来利用胶囊，以在将它们转发到下一层之前掩盖它们。我们评估了有关挑战性GQA以及VQA帽数据集的VQA接地的方法。我们的实验表明：在从标准变压器体系结构中删除蒙版对象的信息的同时，胶囊的集成显着提高了此类系统的接地能力，并提供了与该领域其他方法相比提供的新最先进的结果。

Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding. But most systems that show good performance of those tasks still rely on pre-trained object detectors during training, which limits their applicability to the object classes available for those detectors. To mitigate this limitation, the following paper focuses on the problem of weakly supervised grounding in context of visual question answering in transformers. The approach leverages capsules by grouping each visual token in the visual encoder and uses activations from language self-attention layers as a text-guided selection module to mask those capsules before they are forwarded to the next layer. We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding. Our experiments show that: while removing the information of masked objects from standard transformer architectures leads to a significant drop in performance, the integration of capsules significantly improves the grounding ability of such systems and provides new state-of-the-art results compared to other approaches in the field.

下载PDF全文

下载文献需遵守相关版权规定

论文标题