细粒度捷克新闻文章数据集：一种跨学科的信任分析方法

论文标题

细粒度捷克新闻文章数据集：一种跨学科的信任分析方法

Fine-grained Czech News Article Dataset: An Interdisciplinary Approach to Trustworthiness Analysis

论文作者

Boháček, Matyáš, Bravanský, Michal, Trhlík, Filip, Moravec, Václav

论文摘要

我们介绍了验证数据集：具有细粒度可信度注释的新闻文章的新颖数据集。我们开发了一种详细的方法，该方法基于其参数评估文本，其中包括编辑透明度，记者惯例和客观报告，同时惩罚操纵技术。我们将来自社会，媒体和计算机科学的各种研究人员带到了克服障碍和跨学科问题的有限框架中。我们从近60美元的捷克在线新闻来源收集了超过$ 10,000的独特文章。这些被归类为我们提出的可信度谱系中4美元的$ 4 $类之一，从完全值得信赖的文章一直到操纵性的文章。我们产生详细的统计数据和整个集合中出现的研究趋势。最后，我们使用数据集在可信度分类任务上微调了多个流行的序列到序列语言模型，并报告最佳测试F-1分数$ 0.52 $。我们在https://verifee.ai/research上全长开放数据集，注释方法和注释者的说明，以启用简单的堆积工作。我们认为，类似的方法可以帮助防止媒体素养领域的虚假信息和教育。

We present the Verifee Dataset: a novel dataset of news articles with fine-grained trustworthiness annotations. We develop a detailed methodology that assesses the texts based on their parameters encompassing editorial transparency, journalist conventions, and objective reporting while penalizing manipulative techniques. We bring aboard a diverse set of researchers from social, media, and computer sciences to overcome barriers and limited framing of this interdisciplinary problem. We collect over $10,000$ unique articles from almost $60$ Czech online news sources. These are categorized into one of the $4$ classes across the credibility spectrum we propose, raging from entirely trustworthy articles all the way to the manipulative ones. We produce detailed statistics and study trends emerging throughout the set. Lastly, we fine-tune multiple popular sequence-to-sequence language models using our dataset on the trustworthiness classification task and report the best testing F-1 score of $0.52$. We open-source the dataset, annotation methodology, and annotators' instructions in full length at https://verifee.ai/research to enable easy build-up work. We believe similar methods can help prevent disinformation and educate in the realm of media literacy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题