论文标题
Wikitablet:用于生成Wikipedia文章部分的大规模数据对文本数据集
WikiTableT: A Large-Scale Data-to-Text Dataset for Generating Wikipedia Article Sections
论文作者
论文摘要
数据到文本生成的数据集通常集中于多域,单句子生成或单域,长形成生成。在这项工作中,我们将生成Wikipedia部分的生成作为数据到文本生成任务,并创建一个大型数据集Wikitablet,将Wikipedia部分与相应的表格数据和各种元数据配对。 Wikitablet包含数百万个实例,涵盖了广泛的主题,以及各种具有不同灵活性的生成任务的风味。我们对Wikitablet进行了几项培训和解码策略。我们的定性分析表明,最佳方法可以产生流利和高质量的文本,但它们以连贯性和事实的斗争,这表明了我们数据集启发未来对长期生成的工作的潜力。
Datasets for data-to-text generation typically focus either on multi-domain, single-sentence generation or on single-domain, long-form generation. In this work, we cast generating Wikipedia sections as a data-to-text generation task and create a large-scale dataset, WikiTableT, that pairs Wikipedia sections with their corresponding tabular data and various metadata. WikiTableT contains millions of instances, covering a broad range of topics, as well as a variety of flavors of generation tasks with different levels of flexibility. We benchmark several training and decoding strategies on WikiTableT. Our qualitative analysis shows that the best approaches can generate fluent and high quality texts but they struggle with coherence and factuality, showing the potential for our dataset to inspire future work on long-form generation.