论文标题

Newsedits:新闻文章修订数据集和文档级推理挑战

NewsEdits: A News Article Revision Dataset and a Document-Level Reasoning Challenge

论文作者

Spangher, Alexander, Ren, Xiang, May, Jonathan, Peng, Nanyun

论文摘要

新闻文章修订历史为新闻文章中的叙事和事实演变提供了线索。为了促进对这一进化的分析,我们介绍了新闻修订历史记录的第一个公开可用的数据集。我们的数据集是大规模和多语言的;它包含120万篇文章,其中有460万款来自三个国家 /地区的英语和法语报纸来源,涵盖了15年的报道(2006-2021)。 我们定义了文章级编辑操作:加法,删除,编辑和重构,并开发出高准确的提取算法以识别这些操作。为了强调许多编辑操作的事实性质,我们进行的分析表明,添加和删除的句子更有可能包含更新事件,主内容和报价,而不是不变的句子。 最后,为了探索编辑操作是否可以预测,我们介绍了三个旨在预测版本更新过程中执行的动作的新任务。我们表明,对于人类专业人士来说,这些任务是可能的,但对于大型NLP模型而言是具有挑战性的。我们希望这可以刺激叙事框架的研究,并为追求突发新闻的记者提供预测工具。

News article revision histories provide clues to narrative and factual evolution in news articles. To facilitate analysis of this evolution, we present the first publicly available dataset of news revision histories, NewsEdits. Our dataset is large-scale and multilingual; it contains 1.2 million articles with 4.6 million versions from over 22 English- and French-language newspaper sources based in three countries, spanning 15 years of coverage (2006-2021). We define article-level edit actions: Addition, Deletion, Edit and Refactor, and develop a high-accuracy extraction algorithm to identify these actions. To underscore the factual nature of many edit actions, we conduct analyses showing that added and deleted sentences are more likely to contain updating events, main content and quotes than unchanged sentences. Finally, to explore whether edit actions are predictable, we introduce three novel tasks aimed at predicting actions performed during version updates. We show that these tasks are possible for expert humans but are challenging for large NLP models. We hope this can spur research in narrative framing and help provide predictive tools for journalists chasing breaking news.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源