视觉问题回答是多任务问题

论文标题

视觉问题回答是多任务问题

Visual Question Answering as a Multi-Task Problem

论文作者

Pollard, Amelia Elizabeth, Shapiro, Jonathan L.

论文摘要

视觉问题回答（VQA）是一个高度复杂的问题集，依靠许多子问题来产生合理的答案。在本文中，我们提出了以下假设：视觉问题应视为一个多任务问题，并提供了支持这一假设的证据。我们通过重新格式化两个常用的视觉问题回答可可qa和daquar的数据集为多任务格式，并在两个基线网络上训练这些重新格式化的数据集，其中一个专门设计，旨在消除由于重新格式化而导致的其他可能原因。尽管本文所证明的网络没有取得强烈的竞争成果，但我们发现，视觉问题回答的多任务方法的结果可导致5-9％对单任务格式的绩效提高，并且网络的收敛速度比单件任务案例快得多。最后，我们讨论了观察到的性能差异的可能原因，并执行其他实验，这些实验排除了与学习数据集作为多任务问题无关的原因。

Visual Question Answering(VQA) is a highly complex problem set, relying on many sub-problems to produce reasonable answers. In this paper, we present the hypothesis that Visual Question Answering should be viewed as a multi-task problem, and provide evidence to support this hypothesis. We demonstrate this by reformatting two commonly used Visual Question Answering datasets, COCO-QA and DAQUAR, into a multi-task format and train these reformatted datasets on two baseline networks, with one designed specifically to eliminate other possible causes for performance changes as a result of the reformatting. Though the networks demonstrated in this paper do not achieve strongly competitive results, we find that the multi-task approach to Visual Question Answering results in increases in performance of 5-9% against the single-task formatting, and that the networks reach convergence much faster than in the single-task case. Finally we discuss possible reasons for the observed difference in performance, and perform additional experiments which rule out causes not associated with the learning of the dataset as a multi-task problem.

下载PDF全文

下载文献需遵守相关版权规定

论文标题