论文标题

网页内容提取基于多功能融合

Web Page Content Extraction Based on Multi-feature Fusion

论文作者

Yu, Bowen, Du, Junping, Shao, Yingxia

论文摘要

随着互联网技术的快速发展,人们越来越多地访问各种网页资源。同时,深度学习技术的当前快速发展通常与大量的Web数据资源密不可分。另一方面,NLP也是数据处理技术的重要组成部分,例如网页数据提取。目前,网页文本的提取技术主要使用单个启发式功能或策略,其中大多数需要手动确定阈值。随着Web资源数量和类型的快速增长,使用单个策略来提取不同页面的文本信息时,仍然存在问题。本文提出了基于多功能融合的网页文本提取算法。根据Web资源的文本信息特征,DOM节点被用作设计多个统计特征的提取单元,并且根据启发式策略设计了高阶功能。该方法建立了一个小的神经网络,将DOM节点的多个功能作为输入,可以预测节点是否包含文本信息,充分利用不同的统计信息和提取策略,并适应更多类型的页面。实验结果表明,此方法具有良好的网页文本提取能力,并避免了手动确定阈值的问题。

With the rapid development of Internet technology, people have more and more access to a variety of web page resources. At the same time, the current rapid development of deep learning technology is often inseparable from the huge amount of Web data resources. On the other hand, NLP is also an important part of data processing technology, such as web page data extraction. At present, the extraction technology of web page text mainly uses a single heuristic function or strategy, and most of them need to determine the threshold manually. With the rapid growth of the number and types of web resources, there are still problems to be solved when using a single strategy to extract the text information of different pages. This paper proposes a web page text extraction algorithm based on multi-feature fusion. According to the text information characteristics of web resources, DOM nodes are used as the extraction unit to design multiple statistical features, and high-order features are designed according to heuristic strategies. This method establishes a small neural network, takes multiple features of DOM nodes as input, predicts whether the nodes contain text information, makes full use of different statistical information and extraction strategies, and adapts to more types of pages. Experimental results show that this method has a good ability of web page text extraction and avoids the problem of manually determining the threshold.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源