论文标题
一个大规模的中文短文本对话数据集
A Large-Scale Chinese Short-Text Conversation Dataset
论文作者
论文摘要
神经对话生成模型的进步显示了建模短文本对话的有希望的结果。但是,培训这些模型通常需要大规模的高质量对话语料库,这很难访问。在本文中,我们提供了一个大规模清洁的中国对话数据集LCCC,其中包含基本版本(680万对话)和一个大版本(1,200万个对话)。严格的数据清洁管道确保了我们数据集的质量,该管道是基于一组规则和分类器构建的,该规则和分类器经过手动注释的110k对话对培训。我们还发布了分别在LCCC基本和LCCC-Large进行培训的培训前对话模型。清洁的数据集和训练前模型将有助于研究短文本对话建模。所有模型和数据集均在https://github.com/thu-coai/cdial-gpt上找到。
The advancements of neural dialogue generation models show promising results on modeling short-text conversations. However, training such models usually needs a large-scale high-quality dialogue corpus, which is hard to access. In this paper, we present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues). The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules and a classifier that is trained on manually annotated 110K dialogue pairs. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively. The cleaned dataset and the pre-training models will facilitate the research of short-text conversation modeling. All the models and datasets are available at https://github.com/thu-coai/CDial-GPT.