论文标题

用于用户意图检测的预培训任务,并嵌入电子商务搜索中的检索

Pre-training Tasks for User Intent Detection and Embedding Retrieval in E-commerce Search

论文作者

Qiu, Yiming, Zhao, Chenyu, Zhang, Han, Zhuo, Jingwei, Li, Tianhao, Zhang, Xiaowei, Wang, Songlin, Xu, Sulong, Long, Bo, Yang, Wen-Yun

论文摘要

BERT风格的模型在一般语料库(例如Wikipedia)上预先训练并在特定的任务语料库上进行了微调,最近在许多NLP任务中成为突破性技术:问题答案,文本分类,序列标签等。但是,该技术可能并不总是有效的,尤其是对于两种情况:包含与一般语料库Wikipedia截然不同的语料库,或者是为特定目的学习嵌入空间分布的任务(例如,大约最近的邻居搜索)。在本文中,为了解决我们在工业电子商务搜索系统中遇到的上述两种情况,我们建议针对两个关键模块的自定义和新颖的预训练任务:用户意图检测和语义嵌入检索。微调后定制的预训练模型不到Bert-Base尺寸的10%,以便为具有成本效益的CPU服务可行,可显着改善其他基线模型:1)NO PER-TRAINARE模型和2)使用普通Corpus的官方预培养的BERT,在Offline DataSet和在线系统上。为了重现性和未来的工作,我们开了开源的数据集。

BERT-style models pre-trained on the general corpus (e.g., Wikipedia) and fine-tuned on specific task corpus, have recently emerged as breakthrough techniques in many NLP tasks: question answering, text classification, sequence labeling and so on. However, this technique may not always work, especially for two scenarios: a corpus that contains very different text from the general corpus Wikipedia, or a task that learns embedding spacial distribution for a specific purpose (e.g., approximate nearest neighbor search). In this paper, to tackle the above two scenarios that we have encountered in an industrial e-commerce search system, we propose customized and novel pre-training tasks for two critical modules: user intent detection and semantic embedding retrieval. The customized pre-trained models after fine-tuning, being less than 10% of BERT-base's size in order to be feasible for cost-efficient CPU serving, significantly improve the other baseline models: 1) no pre-training model and 2) fine-tuned model from the official pre-trained BERT using general corpus, on both offline datasets and online system. We have open sourced our datasets for the sake of reproducibility and future works.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源