论文标题
通过填充词汇和单词频率空白,无监督的域改编,以稀疏检索
Unsupervised Domain Adaptation for Sparse Retrieval by Filling Vocabulary and Word Frequency Gaps
论文作者
论文摘要
使用验证的语言模型的IR模型显着超过了BM25等词汇方法。特别是,将文本编码为稀疏向量的Splade是实用使用的有效模型,因为它显示出对室外数据集的稳健性。但是,在训练数据中,Splade仍然在与低频单词的完全匹配中挣扎。此外,词汇和单词频率的域变化会使IR的性能恶化。由于目标域中的监督数据很少,因此需要解决无监督数据的域移动。本文通过填补词汇和文字频率差距提出了一种无监督的域适应方法。首先,我们扩展了词汇量,并在目标域的语料库上使用蒙版语言模型进行连续预处理。然后,我们将碎片编码的稀疏向量乘以反向文档频率权重,以考虑具有较低频率单词的文档的重要性。我们在数据集上使用我们的方法进行了实验,并具有来自源域的较大词汇鸿沟。我们表明,我们的方法优于当前状态的适应方法。此外,我们的方法与BM25结合了最先进的结果。
IR models using a pretrained language model significantly outperform lexical approaches like BM25. In particular, SPLADE, which encodes texts to sparse vectors, is an effective model for practical use because it shows robustness to out-of-domain datasets. However, SPLADE still struggles with exact matching of low-frequency words in training data. In addition, domain shifts in vocabulary and word frequencies deteriorate the IR performance of SPLADE. Because supervision data are scarce in the target domain, addressing the domain shifts without supervision data is necessary. This paper proposes an unsupervised domain adaptation method by filling vocabulary and word-frequency gaps. First, we expand a vocabulary and execute continual pretraining with a masked language model on a corpus of the target domain. Then, we multiply SPLADE-encoded sparse vectors by inverse document frequency weights to consider the importance of documents with lowfrequency words. We conducted experiments using our method on datasets with a large vocabulary gap from a source domain. We show that our method outperforms the present stateof-the-art domain adaptation method. In addition, our method achieves state-of-the-art results, combined with BM25.