小部件字幕：生成移动用户界面元素的自然语言描述

论文标题

小部件字幕：生成移动用户界面元素的自然语言描述

Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements

论文作者

Li, Yang, Li, Gang, He, Luheng, Zheng, Jingjie, Li, Hong, Guan, Zhiwei

论文摘要

用户界面（UI）元素（例如替代文本）的自然语言描述对于总体上对可访问性和基于语言的互动至关重要。但是，这些描述在移动UI中不断缺少。我们提出了小部件字幕，这是一项新的任务，用于从多模式输入中自动为UI元素生成语言描述，包括图像和用户界面的结构表示。我们收集了一个大规模数据集，用于用众包为小部件字幕。我们的数据集包含由人类工人创建的162,859个语言短语，用于注释21,750个独特的UI屏幕中的61,285个UI元素。我们彻底分析数据集，并训练和评估一组深层模型配置，以研究每种功能方式以及学习策略的选择如何影响预测字幕的质量。任务公式和数据集以及我们的基准模型为连接语言和用户界面的新型多模式字幕任务提供了坚实的基础。

Natural language descriptions of user interface (UI) elements such as alternative text are crucial for accessibility and language-based interaction in general. Yet, these descriptions are constantly missing in mobile UIs. We propose widget captioning, a novel task for automatically generating language descriptions for UI elements from multimodal input including both the image and the structural representations of user interfaces. We collected a large-scale dataset for widget captioning with crowdsourcing. Our dataset contains 162,859 language phrases created by human workers for annotating 61,285 UI elements across 21,750 unique UI screens. We thoroughly analyze the dataset, and train and evaluate a set of deep model configurations to investigate how each feature modality as well as the choice of learning strategies impact the quality of predicted captions. The task formulation and the dataset as well as our benchmark models contribute a solid basis for this novel multimodal captioning task that connects language and user interfaces.

下载PDF全文

下载文献需遵守相关版权规定

论文标题