Ernie-Mmlayout：多层多模式变压器用于文档理解

论文标题

Ernie-Mmlayout：多层多模式变压器用于文档理解

ERNIE-mmLayout: Multi-grained MultiModal Transformer for Document Understanding

论文作者

Wang, Wenjin, Huang, Zhengjie, Luo, Bin, Chen, Qianglong, Peng, Qiming, Pan, Yinxu, Yin, Weichong, Feng, Shikun, Sun, Yu, Yu, Dianhai, Zhang, Yin

论文摘要

多模式变压器的最新努力通过合并视觉和文本信息改善了视觉上丰富的文档理解（VRDU）任务。但是，现有的方法主要集中于诸如单词和文档图像贴片之类的细粒元素，这使得他们很难从粗粒元素中学习，包括诸如短语和显着视觉区域之类的自然词汇单元，例如著名的图像区域。在本文中，我们对包含高密度信息和一致语义的粗粒元素更为重要，这对于文档理解很有价值。首先，提出了文档图，以模拟多层多模式元素之间的复杂关系，其中通过基于群集的方法检测到显着的视觉区域。然后，提出了一种称为mmlayout的多模式变压器，以将粗粒的信息纳入基于图形的现有预训练的细颗粒的多峰变压器中。在mmlayout中，粗粒信息是从细粒度汇总的，然后在进一步处理后，将其融合到细粒度中以进行最终预测。此外，引入常识增强以利用天然词汇单元的语义信息。关于四个任务的实验结果，包括信息提取和文档问答，表明我们的方法可以根据细粒元素来改善多模式变压器的性能，并使用更少的参数实现更好的性能。定性分析表明，我们的方法可以在粗粒元素中捕获一致的语义。

Recent efforts of multimodal Transformers have improved Visually Rich Document Understanding (VrDU) tasks via incorporating visual and textual information. However, existing approaches mainly focus on fine-grained elements such as words and document image patches, making it hard for them to learn from coarse-grained elements, including natural lexical units like phrases and salient visual regions like prominent image regions. In this paper, we attach more importance to coarse-grained elements containing high-density information and consistent semantics, which are valuable for document understanding. At first, a document graph is proposed to model complex relationships among multi-grained multimodal elements, in which salient visual regions are detected by a cluster-based method. Then, a multi-grained multimodal Transformer called mmLayout is proposed to incorporate coarse-grained information into existing pre-trained fine-grained multimodal Transformers based on the graph. In mmLayout, coarse-grained information is aggregated from fine-grained, and then, after further processing, is fused back into fine-grained for final prediction. Furthermore, common sense enhancement is introduced to exploit the semantic information of natural lexical units. Experimental results on four tasks, including information extraction and document question answering, show that our method can improve the performance of multimodal Transformers based on fine-grained elements and achieve better performance with fewer parameters. Qualitative analyses show that our method can capture consistent semantics in coarse-grained elements.

下载PDF全文

下载文献需遵守相关版权规定

论文标题