论文标题
通过AOL查询日志复制个性化会话搜索
Reproducing Personalised Session Search over the AOL Query Log
论文作者
论文摘要
尽管过去陷入困境,但AOL查询日志仍然是研究社区的重要资源,尤其是对于搜索个性化等任务。使用查询日志时,这些排名实验通常很少关注文档语料库。最近的工作通常使用包含在日志生产后很长时间收集的文档版本的语料库。鉴于Web文档很容易随着时间的流逝而更改,因此我们研究了包含文档的语料库版本之间存在的差异(这些文档已在2017年出现(已被最近的几项作品使用)和我们构建的新版本,其中包含在查询日志生产时出现的文档(2006年)。我们证明,这种新版本的语料库的覆盖范围要比2017年版本(93%)(93%)(55%)高得多。在重叠的文档中,内容通常有很大不同。鉴于这些差异,我们重新调查了最初使用2017年语料库的会话搜索实验,并发现在使用我们的语料库进行培训或评估时,系统绩效会有所改善。我们通过引入最近的Adhoc排名基线来将结果放在上下文中。我们还通过显示包括URL可以大大提高各种模型的性能,从而确认了AOL语料库中查询的导航性质。我们的语料库版本可以很容易地由其他研究人员重建,并包含在IR-Datasets软件包中。
Despite its troubled past, the AOL Query Log continues to be an important resource to the research community -- particularly for tasks like search personalisation. When using the query log these ranking experiments, little attention is usually paid to the document corpus. Recent work typically uses a corpus containing versions of the documents collected long after the log was produced. Given that web documents are prone to change over time, we study the differences present between a version of the corpus containing documents as they appeared in 2017 (which has been used by several recent works) and a new version we construct that includes documents close to as they appeared at the time the query log was produced (2006). We demonstrate that this new version of the corpus has a far higher coverage of documents present in the original log (93%) than the 2017 version (55%). Among the overlapping documents, the content often differs substantially. Given these differences, we re-conduct session search experiments that originally used the 2017 corpus and find that when using our corpus for training or evaluation, system performance improves. We place the results in context by introducing recent adhoc ranking baselines. We also confirm the navigational nature of the queries in the AOL corpus by showing that including the URL substantially improves performance across a variety of models. Our version of the corpus can be easily reconstructed by other researchers and is included in the ir-datasets package.