比空格要好：没有自定义引物的语言的信息检索

论文标题

比空格要好：没有自定义引物的语言的信息检索

Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers

论文作者

Ogundepo, Odunayo, Zhang, Xinyu, Lin, Jimmy

论文摘要

令牌化是信息检索的关键步骤，尤其是对于词汇匹配算法，可索引令牌的质量直接影响检索系统的有效性。由于不同的语言具有独特的属性，因此令牌化算法的设计通常是特定于语言的，至少需要一些lingustic知识。但是，该星球上只有少数7000多种语言受益于专业的，定制的令牌化算法，而其他语言则被“默认的”空间标记器粘住，这无法捕获不同语言的复杂性。为了应对这一挑战，我们为词汇匹配检索算法提出了一种不同的令牌化方法（例如，BM25）：使用WordPiece Tokenizer，可以从无处不在的数据中自动构建。我们在MRTYDI集合中的11种类型上多样的语言上测试了该方法：结果表明，Mbert Tokenizer为检索“开箱即用”提供了强烈的相关性信号，在大多数语言上都优于空格令牌。在许多情况下，我们的方法与现有的定制构造物相结合时也提高了检索效率。

Tokenization is a crucial step in information retrieval, especially for lexical matching algorithms, where the quality of indexable tokens directly impacts the effectiveness of a retrieval system. Since different languages have unique properties, the design of the tokenization algorithm is usually language-specific and requires at least some lingustic knowledge. However, only a handful of the 7000+ languages on the planet benefit from specialized, custom-built tokenization algorithms, while the other languages are stuck with a "default" whitespace tokenizer, which cannot capture the intricacies of different languages. To address this challenge, we propose a different approach to tokenization for lexical matching retrieval algorithms (e.g., BM25): using the WordPiece tokenizer, which can be built automatically from unsupervised data. We test the approach on 11 typologically diverse languages in the MrTyDi collection: results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages. In many cases, our approach also improves retrieval effectiveness when combined with existing custom-built tokenizers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题