论文标题

样板检测通过文本块的语义分类

Boilerplate Detection via Semantic Classification of TextBlocks

论文作者

Zhang, Hao, Wang, Jie

论文摘要

我们提出了一个名为SEMTEXT的分层神经网络模型,该模型基于HTML标签,类名称和文本块的新型语义表示检测HTML样板。我们在三个已发布的新闻网页数据集上训练SemText,并使用Cleaneval和GoogletRends-2017中的少量开发数据进行微调。我们表明,SemText在这些数据集上实现了最新的准确性。然后,我们通过证明它还在基于社区的核对问题 - 答案网页上有效地检测样板来证明SEMTEXT的鲁棒性。

We present a hierarchical neural network model called SemText to detect HTML boilerplate based on a novel semantic representation of HTML tags, class names, and text blocks. We train SemText on three published datasets of news webpages and fine-tune it using a small number of development data in CleanEval and GoogleTrends-2017. We show that SemText achieves the state-of-the-art accuracy on these datasets. We then demonstrate the robustness of SemText by showing that it also detects boilerplate effectively on out-of-domain community-based question-answer webpages.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源