用于用户意图检测的预培训任务，并嵌入电子商务搜索中的检索

论文标题

用于用户意图检测的预培训任务，并嵌入电子商务搜索中的检索

Pre-training Tasks for User Intent Detection and Embedding Retrieval in E-commerce Search

论文作者

Qiu, Yiming, Zhao, Chenyu, Zhang, Han, Zhuo, Jingwei, Li, Tianhao, Zhang, Xiaowei, Wang, Songlin, Xu, Sulong, Long, Bo, Yang, Wen-Yun

论文摘要

BERT风格的模型在一般语料库（例如Wikipedia）上预先训练并在特定的任务语料库上进行了微调，最近在许多NLP任务中成为突破性技术：问题答案，文本分类，序列标签等。但是，该技术可能并不总是有效的，尤其是对于两种情况：包含与一般语料库Wikipedia截然不同的语料库，或者是为特定目的学习嵌入空间分布的任务（例如，大约最近的邻居搜索）。在本文中，为了解决我们在工业电子商务搜索系统中遇到的上述两种情况，我们建议针对两个关键模块的自定义和新颖的预训练任务：用户意图检测和语义嵌入检索。微调后定制的预训练模型不到Bert-Base尺寸的10％，以便为具有成本效益的CPU服务可行，可显着改善其他基线模型：1）NO PER-TRAINARE模型和2）使用普通Corpus的官方预培养的BERT，在Offline DataSet和在线系统上。为了重现性和未来的工作，我们开了开源的数据集。

BERT-style models pre-trained on the general corpus (e.g., Wikipedia) and fine-tuned on specific task corpus, have recently emerged as breakthrough techniques in many NLP tasks: question answering, text classification, sequence labeling and so on. However, this technique may not always work, especially for two scenarios: a corpus that contains very different text from the general corpus Wikipedia, or a task that learns embedding spacial distribution for a specific purpose (e.g., approximate nearest neighbor search). In this paper, to tackle the above two scenarios that we have encountered in an industrial e-commerce search system, we propose customized and novel pre-training tasks for two critical modules: user intent detection and semantic embedding retrieval. The customized pre-trained models after fine-tuning, being less than 10% of BERT-base's size in order to be feasible for cost-efficient CPU serving, significantly improve the other baseline models: 1) no pre-training model and 2) fine-tuned model from the official pre-trained BERT using general corpus, on both offline datasets and online system. We have open sourced our datasets for the sake of reproducibility and future works.

下载PDF全文

下载文献需遵守相关版权规定

论文标题