盲人拍摄的字幕图像

论文标题

盲人拍摄的字幕图像

Captioning Images Taken by People Who Are Blind

论文作者

Gurari, Danna, Zhao, Yinan, Zhang, Meng, Bhattacharya, Nilavra

论文摘要

虽然视觉社区中的一个重要问题是设计可以自动标题图像的算法，但很少有用于算法开发的公共可用数据集直接解决真实用户的兴趣。观察到盲人的人依靠（基于人类的）图像字幕服务来了解他们近十年来拍摄的图像，因此我们介绍了第一个图像字幕数据集来代表这种真实的用例。这个新数据集（我们称为Vizwiz-Captions）由39,000多张图像组成，这些图像来自盲人的人，每个人都与五个字幕配对。我们将该数据集分析为（1）表征典型的字幕，（2）表征图像中发现的内容的多样性，以及（3）将其内容与八个流行视觉数据集中发现的内容进行比较。我们还分析了现代图像字幕算法，以确定使这一新数据集对视觉社区挑战的原因。我们在https://vizwiz.org上通过字幕挑战说明公开展示数据集

While an important problem in the vision community is to design algorithms that can automatically caption images, few publicly-available datasets for algorithm development directly address the interests of real users. Observing that people who are blind have relied on (human-based) image captioning services to learn about images they take for nearly a decade, we introduce the first image captioning dataset to represent this real use case. This new dataset, which we call VizWiz-Captions, consists of over 39,000 images originating from people who are blind that are each paired with five captions. We analyze this dataset to (1) characterize the typical captions, (2) characterize the diversity of content found in the images, and (3) compare its content to that found in eight popular vision datasets. We also analyze modern image captioning algorithms to identify what makes this new dataset challenging for the vision community. We publicly-share the dataset with captioning challenge instructions at https://vizwiz.org

下载PDF全文

下载文献需遵守相关版权规定

论文标题