一个大规模的中文短文本对话数据集

论文标题

一个大规模的中文短文本对话数据集

A Large-Scale Chinese Short-Text Conversation Dataset

论文作者

Wang, Yida, Ke, Pei, Zheng, Yinhe, Huang, Kaili, Jiang, Yong, Zhu, Xiaoyan, Huang, Minlie

论文摘要

神经对话生成模型的进步显示了建模短文本对话的有希望的结果。但是，培训这些模型通常需要大规模的高质量对话语料库，这很难访问。在本文中，我们提供了一个大规模清洁的中国对话数据集LCCC，其中包含基本版本（680万对话）和一个大版本（1,200万个对话）。严格的数据清洁管道确保了我们数据集的质量，该管道是基于一组规则和分类器构建的，该规则和分类器经过手动注释的110k对话对培训。我们还发布了分别在LCCC基本和LCCC-Large进行培训的培训前对话模型。清洁的数据集和训练前模型将有助于研究短文本对话建模。所有模型和数据集均在https://github.com/thu-coai/cdial-gpt上找到。

The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at https://github.com/thu-coai/CDial-GPT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题