论文标题

ORCAS:1800万个点击查询文件对,用于分析搜索

ORCAS: 18 Million Clicked Query-Document Pairs for Analyzing Search

论文作者

Craswell, Nick, Campos, Daniel, Mitra, Bhaskar, Yilmaz, Emine, Billerbeck, Bodo

论文摘要

Web搜索引擎的用户通过查询和点击揭示其信息需求,使点击日志成为信息检索的有用资产。但是,点击日志尚未公开发布供学术使用,因为它们可能太揭示了个人或商业敏感的信息。本文介绍了与TREC深度学习轨道文档语料库有关的点击数据发布。在汇总和过滤(包括K-匿名性要求)之后,我们发现140万个TREC DL URL具有1800万个连接到1000万个不同的查询。我们对这些查询的数据集以及与TREC文档的连接的数据集类似于以前有关查询挖掘和排名的论文中使用的专有数据集。我们使用点击数据进行一些初步实验来增强TREC DL训练数据,从而提供:28倍的查询,其中49倍的连接多于语料库中的URL 4.4倍。我们介绍了数据集的生成过程,特征,在排名中的使用并提出其他潜在用途的描述。

Users of Web search engines reveal their information needs through queries and clicks, making click logs a useful asset for information retrieval. However, click logs have not been publicly released for academic use, because they can be too revealing of personally or commercially sensitive information. This paper describes a click data release related to the TREC Deep Learning Track document corpus. After aggregation and filtering, including a k-anonymity requirement, we find 1.4 million of the TREC DL URLs have 18 million connections to 10 million distinct queries. Our dataset of these queries and connections to TREC documents is of similar size to proprietary datasets used in previous papers on query mining and ranking. We perform some preliminary experiments using the click data to augment the TREC DL training data, offering by comparison: 28x more queries, with 49x more connections to 4.4x more URLs in the corpus. We present a description of the dataset's generation process, characteristics, use in ranking and suggest other potential uses.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源