视觉语义解析：从图像到抽象含义表示

论文标题

视觉语义解析：从图像到抽象含义表示

Visual Semantic Parsing: From Images to Abstract Meaning Representation

论文作者

Abdelsalam, Mohamed Ashraf, Shi, Zhan, Fancellu, Federico, Basioti, Kalliopi, Bhatt, Dhaivat J., Pavlovic, Vladimir, Fazly, Afsaneh

论文摘要

场景图的视觉场景理解的成功使人们注意将视觉输入（例如图像）抽象成结构化表示的好处，其中实体（人和对象）是通过指定其关系的边缘连接的节点。但是，构建这些表示形式需要以图像的形式与场景图或框架配对的形式昂贵的手动注释。这些形式主义在它们可以捕获的实体和关系的本质上仍然有限。在本文中，我们建议利用自然语言处理领域中广泛使用的含义表示，即抽象含义表示（AMR）来解决这些缺点。与很大程度上强调空间关系的场景图相比，我们的视觉AMR图在语言上更加了解，重点是从视觉输入中推出的高级语义概念。此外，它们允许我们生成元AMR图，以统一一个表示下的多个图像描述中包含的信息。通过广泛的实验和分析，我们证明我们可以将现有的文本对AMR解析器重新分析以将图像解析为AMR。我们的发现指出了重要的未来研究方向，以改善场景的理解。

The success of scene graphs for visual scene understanding has brought attention to the benefits of abstracting a visual input (e.g., image) into a structured representation, where entities (people and objects) are nodes connected by edges specifying their relations. Building these representations, however, requires expensive manual annotation in the form of images paired with their scene graphs or frames. These formalisms remain limited in the nature of entities and relations they can capture. In this paper, we propose to leverage a widely-used meaning representation in the field of natural language processing, the Abstract Meaning Representation (AMR), to address these shortcomings. Compared to scene graphs, which largely emphasize spatial relationships, our visual AMR graphs are more linguistically informed, with a focus on higher-level semantic concepts extrapolated from visual input. Moreover, they allow us to generate meta-AMR graphs to unify information contained in multiple image descriptions under one representation. Through extensive experimentation and analysis, we demonstrate that we can re-purpose an existing text-to-AMR parser to parse images into AMRs. Our findings point to important future research directions for improved scene understanding.

下载PDF全文

下载文献需遵守相关版权规定

论文标题