论文标题
Aquamuse:自动生成基于查询的多文件摘要的数据集
AQuaMuSe: Automatically Generating Datasets for Query-Based Multi-Document Summarization
论文作者
论文摘要
摘要是将源文档压缩到连贯和简洁的段落中的任务。这是一个有价值的工具,可以向用户提供与查询有关的顶级文档的简洁明了的草图。基于查询的多文件摘要(QMDS)解决了这种普遍的需求,但是由于缺乏培训和评估数据集,该研究受到严重限制,因为现有的单文件和多文件摘要数据集的形式和规模不足。我们提出了一种称为Aquamuse的可扩展方法,可以自动从问答数据集和大型文档语料库中挖掘QMDS示例。我们的方法是独一无二的,从某种意义上说,它可以将双数据集概述 - 用于提取和抽象的摘要。我们公开发布了一个具有5,519个基于查询的摘要的Aquamuse数据集的特定实例,每个摘要平均与从Common Crawl的3.55亿个文档索引中选择的6个输入文档相关联。提供了对数据集的广泛评估以及基线汇总模型实验。
Summarization is the task of compressing source document(s) into coherent and succinct passages. This is a valuable tool to present users with concise and accurate sketch of the top ranked documents related to their queries. Query-based multi-document summarization (qMDS) addresses this pervasive need, but the research is severely limited due to lack of training and evaluation datasets as existing single-document and multi-document summarization datasets are inadequate in form and scale. We propose a scalable approach called AQuaMuSe to automatically mine qMDS examples from question answering datasets and large document corpora. Our approach is unique in the sense that it can general a dual dataset -- for extractive and abstractive summaries both. We publicly release a specific instance of an AQuaMuSe dataset with 5,519 query-based summaries, each associated with an average of 6 input documents selected from an index of 355M documents from Common Crawl. Extensive evaluation of the dataset along with baseline summarization model experiments are provided.