论文标题
CMT在TREC-COVID第2轮中:减轻从网络到特殊域搜索的概括差距
CMT in TREC-COVID Round 2: Mitigating the Generalization Gaps from Web to Special Domain Search
论文作者
论文摘要
已经证明基于深度审慎的语言模型(LMS)的神经排名者可以改善许多信息检索基准。但是,这些方法受到训练域和目标域之间的相关性的影响,并依赖于大量的微调相关性标签。直接将预训练方法应用于特定域可能会导致次优搜索质量,因为特定域可能存在域的适应问题,例如COVID域。本文提出了一个搜索系统,以减轻特殊领域的适应问题。该系统利用领域自适应预处理和几乎没有射击的学习技术来帮助神经排名减轻域的差异和标签稀缺性问题。此外,我们还整合了密集的检索,以减轻传统稀疏检索的词汇不匹配障碍。我们的系统在TREC-COVID任务的第2轮中表现最好,该任务的第2轮旨在从与Covid-19相关的科学文献中检索有用的信息。我们的代码可在https://github.com/thunlp/openmatch上公开获取。
Neural rankers based on deep pretrained language models (LMs) have been shown to improve many information retrieval benchmarks. However, these methods are affected by their the correlation between pretraining domain and target domain and rely on massive fine-tuning relevance labels. Directly applying pretraining methods to specific domains may result in suboptimal search quality because specific domains may have domain adaption problems, such as the COVID domain. This paper presents a search system to alleviate the special domain adaption problem. The system utilizes the domain-adaptive pretraining and few-shot learning technologies to help neural rankers mitigate the domain discrepancy and label scarcity problems. Besides, we also integrate dense retrieval to alleviate traditional sparse retrieval's vocabulary mismatch obstacle. Our system performs the best among the non-manual runs in Round 2 of the TREC-COVID task, which aims to retrieve useful information from scientific literature related to COVID-19. Our code is publicly available at https://github.com/thunlp/OpenMatch.