多页DOCVQA的分层多模式变压器

论文标题

多页DOCVQA的分层多模式变压器

Hierarchical multimodal transformers for Multi-Page DocVQA

论文作者

Tito, Rubèn, Karatzas, Dimosthenis, Valveny, Ernest

论文摘要

文档视觉问题回答（DOCVQA）是指从文档图像中回答问题的任务。现有在DOCVQA上仅考虑单页文档。但是，在实际场景中，文档主要由多个页面组成，应完全处理。在这项工作中，我们将DOCVQA扩展到了多页的方案。为此，我们首先创建了一个新的数据集MP-DOCVQA，其中提出了多页文档而不是单页的问题。其次，我们根据T5体系结构提出了一种新的层次结构方法HI-VT5，该方法克服了当前方法处理长多页文档的局限性。提出的方法基于层次变压器体系结构，在该层次结构架构中，编码器总结了每个页面的最相关信息，然后解码器将此汇总的信息获取以生成最终答案。通过广泛的实验，我们证明了我们的方法可以在一个阶段回答问题并提供包含相关信息以找到答案的页面，可以用作一种解释性度量。

Document Visual Question Answering (DocVQA) refers to the task of answering questions from document images. Existing work on DocVQA only considers single-page documents. However, in real scenarios documents are mostly composed of multiple pages that should be processed altogether. In this work we extend DocVQA to the multi-page scenario. For that, we first create a new dataset, MP-DocVQA, where questions are posed over multi-page documents instead of single pages. Second, we propose a new hierarchical method, Hi-VT5, based on the T5 architecture, that overcomes the limitations of current methods to process long multi-page documents. The proposed method is based on a hierarchical transformer architecture where the encoder summarizes the most relevant information of every page and then, the decoder takes this summarized information to generate the final answer. Through extensive experimentation, we demonstrate that our method is able, in a single stage, to answer the questions and provide the page that contains the relevant information to find the answer, which can be used as a kind of explainability measure.

下载PDF全文

下载文献需遵守相关版权规定

论文标题