论文标题

PARM:密集文档到文档检索的段落聚合检索模型

PARM: A Paragraph Aggregation Retrieval Model for Dense Document-to-Document Retrieval

论文作者

Althammer, Sophia, Hofstätter, Sebastian, Sertkan, Mete, Verberne, Suzan, Hanbury, Allan

论文摘要

密集的通道检索(DPR)模型在Web领域的第一阶段检索中显示出巨大的有效性。但是,在Web域中,我们处于具有大量培训数据以及查询到邮票或查询到文档检索任务的设置中。我们在本文中调查了密集的文件到文档检索,并使用有限的标记目标数据进行培训,特别是法律案例检索。为了将DPR模型用于文档对文档检索,我们提出了一个段落聚合检索模型(PARM),该模型(PARM)将DPR模型从其有限的输入长度中解放出来。 PARM检索段落级别的文档:对于每个查询段落,根据其段落检索相关文档。然后将每个查询段落的相关结果汇总为整个查询文档的一个排名列表。对于聚集,我们提出了基于矢量的聚合使用倒数融合(VRRF)加权,该加权结合了基于密集的嵌入的基于等级的聚合和局部聚合的优势。实验结果表明,VRRF优于基于等级的聚合策略,用于用PARM进行密集的文档回收。我们将PARM与文档级检索进行了比较,并在两个不同的法律案件检索中表现出了PARM对词汇和密集的第一阶段检索的较高检索效果。我们研究了如何在有限的目标数据上使用段落或文档级别的标签训练PARM的密集检索模型。此外,我们分析了用PARM检索到的词汇和致密检索结果的差异。

Dense passage retrieval (DPR) models show great effectiveness gains in first stage retrieval for the web domain. However in the web domain we are in a setting with large amounts of training data and a query-to-passage or a query-to-document retrieval task. We investigate in this paper dense document-to-document retrieval with limited labelled target data for training, in particular legal case retrieval. In order to use DPR models for document-to-document retrieval, we propose a Paragraph Aggregation Retrieval Model (PARM) which liberates DPR models from their limited input length. PARM retrieves documents on the paragraph-level: for each query paragraph, relevant documents are retrieved based on their paragraphs. Then the relevant results per query paragraph are aggregated into one ranked list for the whole query document. For the aggregation we propose vector-based aggregation with reciprocal rank fusion (VRRF) weighting, which combines the advantages of rank-based aggregation and topical aggregation based on the dense embeddings. Experimental results show that VRRF outperforms rank-based aggregation strategies for dense document-to-document retrieval with PARM. We compare PARM to document-level retrieval and demonstrate higher retrieval effectiveness of PARM for lexical and dense first-stage retrieval on two different legal case retrieval collections. We investigate how to train the dense retrieval model for PARM on limited target data with labels on the paragraph or the document-level. In addition, we analyze the differences of the retrieved results of lexical and dense retrieval with PARM.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源