论文标题
视觉问题需要回答视觉问题
Vision Skills Needed to Answer Visual Questions
论文作者
论文摘要
回答有关图像问题的任务吸引了人们的关注,作为一种实用服务,可帮助人群造成视觉障碍以及对人工智能社区的视觉图灵测试。我们的第一个目的是确定两种情况所需的共同愿景技能。为此,我们分析了对四种视觉技能的需求---对象识别,文本识别,颜色识别以及计数 - - 来自两个数据集的27,000多个视觉问题,代表这两种情况。接下来,我们量化了两个数据集中人类和计算机的这些技能的难度。最后,我们提出了一项新的任务,以预测需要什么视觉技能来回答有关图像的问题。我们的结果揭示了此类服务的真实用户的目标与AI社区的重点之间的匹配(MIS)。最后,我们讨论了解决视觉问题回答任务的未来方向。
The task of answering questions about images has garnered attention as a practical service for assisting populations with visual impairments as well as a visual Turing test for the artificial intelligence community. Our first aim is to identify the common vision skills needed for both scenarios. To do so, we analyze the need for four vision skills---object recognition, text recognition, color recognition, and counting---on over 27,000 visual questions from two datasets representing both scenarios. We next quantify the difficulty of these skills for both humans and computers on both datasets. Finally, we propose a novel task of predicting what vision skills are needed to answer a question about an image. Our results reveal (mis)matches between aims of real users of such services and the focus of the AI community. We conclude with a discussion about future directions for addressing the visual question answering task.