论文标题
WebFormer:用于结构信息提取的Web页面变压器
WebFormer: The Web-page Transformer for Structure Information Extraction
论文作者
论文摘要
结构信息提取是指从网页中提取结构化文本字段的任务,例如从购物页面中提取产品报价,包括产品标题,描述,品牌和价格。这是一个重要的研究主题,已在文档理解和网络搜索中进行了广泛研究。具有序列建模的最新自然语言模型已证明了Web信息提取方面的最新性能。但是,由于各种Web布局模式,在实践中有效地序列化令牌在实践中具有挑战性。有限的工作重点是建模用于提取文本字段的Web布局。在本文中,我们介绍了WebFormer,这是一个网页变压器模型,用于从Web文档中提取结构信息。首先,我们通过通过图形注意来嵌入其相邻令牌的表示,为HTML中的每个DOM节点设计HTML令牌。其次,我们在HTML令牌和文本令牌之间构建了丰富的注意力模式,该图案利用Web布局进行有效的注意力计算。我们对SWDE和常见的爬网基准进行了广泛的实验。实验结果表明,所提出的方法比几种最新方法的表现出色。
Structure information extraction refers to the task of extracting structured text fields from web pages, such as extracting a product offer from a shopping page including product title, description, brand and price. It is an important research topic which has been widely studied in document understanding and web search. Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction. However, effectively serializing tokens from unstructured web pages is challenging in practice due to a variety of web layout patterns. Limited work has focused on modeling the web layout for extracting the text fields. In this paper, we introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents. First, we design HTML tokens for each DOM node in the HTML by embedding representations from their neighboring tokens through graph attention. Second, we construct rich attention patterns between HTML tokens and text tokens, which leverages the web layout for effective attention weight computation. We conduct an extensive set of experiments on SWDE and Common Crawl benchmarks. Experimental results demonstrate the superior performance of the proposed approach over several state-of-the-art methods.