论文标题
快速文字令牌化
Fast WordPiece Tokenization
论文作者
论文摘要
令牌化是几乎所有NLP任务的基本预处理步骤。在本文中,我们提出了BERT中使用的文字代币化的有效算法,从单词令牌化到一般文本(例如,句子)令牌化。当一个单词化时,WordPiece使用最长的匹配优点策略,称为最大匹配。到目前为止,最著名的算法是O(n^2)(其中n是输入长度)或O(nm)(其中m是最大词汇令牌长度)。我们提出了一种新颖的算法,该算法严格是O(n)。我们的方法灵感来自Aho-Corasick算法。我们在通过词汇构建的Trie之上介绍了其他链接,在Trie匹配无法继续时允许智能过渡。对于一般文本,我们进一步提出了一种结合了预习惯(将文本分为单词)和线性时间文字方法组合到单个通行证中的算法。实验结果表明,我们的方法比拥抱面引物快8.2倍,而对于一般文本令牌的平均而言,平均而言,张曲流的文本比张量的文本快5.1倍。
Tokenization is a fundamental preprocessing step for almost all NLP tasks. In this paper, we propose efficient algorithms for the WordPiece tokenization used in BERT, from single-word tokenization to general text (e.g., sentence) tokenization. When tokenizing a single word, WordPiece uses a longest-match-first strategy, known as maximum matching. The best known algorithms so far are O(n^2) (where n is the input length) or O(nm) (where m is the maximum vocabulary token length). We propose a novel algorithm whose tokenization complexity is strictly O(n). Our method is inspired by the Aho-Corasick algorithm. We introduce additional linkages on top of the trie built from the vocabulary, allowing smart transitions when the trie matching cannot continue. For general text, we further propose an algorithm that combines pre-tokenization (splitting the text into words) and our linear-time WordPiece method into a single pass. Experimental results show that our method is 8.2x faster than HuggingFace Tokenizers and 5.1x faster than TensorFlow Text on average for general text tokenization.