CLTS+：带有抽象摘要的新的中文长文本摘要数据集

论文标题

CLTS+：带有抽象摘要的新的中文长文本摘要数据集

CLTS+: A New Chinese Long Text Summarization Dataset with Abstractive Summaries

论文作者

Liu, Xiaojun, Zang, Shunan, Zhang, Chuang, Chen, Xiaojun, Ding, Yangyang

论文摘要

缺乏创造力的抽象方法在自动文本摘要中尤其是一个问题。模型产生的摘要主要是从源文章中提取的。该问题的主要原因之一是缺乏抽象性的数据集，尤其是对于中文而言。为了解决此问题，我们可以解释CLT中的参考摘要，中国长的文本摘要数据集，正确的事实不一致的错误，并提出了具有高度抽象性的第一个中国长文本摘要数据集，CLTS+，其中包含超过180k的文章 - 符合文章 - 符合条款 - 符合条款和在线。此外，我们引入了一个基于共发生词的固有度量，以评估我们构建的数据集。我们分析了CLTS+摘要中使用的提取策略，以量化新数据的抽象性和难度，并在CLTS+上培训几个基线，以验证其实用性以提高模型的创造力。

The abstractive methods lack of creative ability is particularly a problem in automatic text summarization. The summaries generated by models are mostly extracted from the source articles. One of the main causes for this problem is the lack of dataset with abstractiveness, especially for Chinese. In order to solve this problem, we paraphrase the reference summaries in CLTS, the Chinese Long Text Summarization dataset, correct errors of factual inconsistencies, and propose the first Chinese Long Text Summarization dataset with a high level of abstractiveness, CLTS+, which contains more than 180K article-summary pairs and is available online. Additionally, we introduce an intrinsic metric based on co-occurrence words to evaluate the dataset we constructed. We analyze the extraction strategies used in CLTS+ summaries against other datasets to quantify the abstractiveness and difficulty of our new data and train several baselines on CLTS+ to verify the utility of it for improving the creative ability of models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题