论文标题

SEDR:长文档的细分表示学习密集检索

SeDR: Segment Representation Learning for Long Documents Dense Retrieval

论文作者

Chen, Junying, Chen, Qingcai, Li, Dongfang, Huang, Yutao

论文摘要

最近,密集的检索(DR)已成为文档检索的有前途的解决方案,在该解决方案中,文档表示用于执行有效,有效的语义搜索。但是,由于其基于变压器的编码器的二次复杂性以及低维嵌入的有限能力,DR在长文档上仍然具有挑战性。当前的DR模型使用次优策略,例如将截断或分割和拆卸对长文档进行截断或分割,从而导致整个文档信息的利用率不佳。在这项工作中,为了解决这个问题,我们建议对长文档密集检索(SEDR)的细分表示学习。在SEDR中,建议将段 - 交流变压器编码为文档感知和段敏感表示的长文档,而它具有分裂和隔离的复杂性,并优于DR上的其他细分交流模式。由于长期编码的长期文档的GPU内存要求会导致DR培训的负面负面因素,因此进一步提出了晚缓存负数,以提供额外的缓存负面因素以优化表示表示。关于MARCO女士和TREC-DL数据集的实验表明,SEDR在DR模型中实现了卓越的性能,并确认了SEDR对长文档检索的有效性。

Recently, Dense Retrieval (DR) has become a promising solution to document retrieval, where document representations are used to perform effective and efficient semantic search. However, DR remains challenging on long documents, due to the quadratic complexity of its Transformer-based encoder and the finite capacity of a low-dimension embedding. Current DR models use suboptimal strategies such as truncating or splitting-and-pooling to long documents leading to poor utilization of whole document information. In this work, to tackle this problem, we propose Segment representation learning for long documents Dense Retrieval (SeDR). In SeDR, Segment-Interaction Transformer is proposed to encode long documents into document-aware and segment-sensitive representations, while it holds the complexity of splitting-and-pooling and outperforms other segment-interaction patterns on DR. Since GPU memory requirements for long document encoding causes insufficient negatives for DR training, Late-Cache Negative is further proposed to provide additional cache negatives for optimizing representation learning. Experiments on MS MARCO and TREC-DL datasets show that SeDR achieves superior performance among DR models, and confirm the effectiveness of SeDR on long document retrieval.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源