Aquamuse：自动生成基于查询的多文件摘要的数据集

论文标题

Aquamuse：自动生成基于查询的多文件摘要的数据集

AQuaMuSe: Automatically Generating Datasets for Query-Based Multi-Document Summarization

论文作者

Kulkarni, Sayali, Chammas, Sheide, Zhu, Wan, Sha, Fei, Ie, Eugene

论文摘要

摘要是将源文档压缩到连贯和简洁的段落中的任务。这是一个有价值的工具，可以向用户提供与查询有关的顶级文档的简洁明了的草图。基于查询的多文件摘要（QMDS）解决了这种普遍的需求，但是由于缺乏培训和评估数据集，该研究受到严重限制，因为现有的单文件和多文件摘要数据集的形式和规模不足。我们提出了一种称为Aquamuse的可扩展方法，可以自动从问答数据集和大型文档语料库中挖掘QMDS示例。我们的方法是独一无二的，从某种意义上说，它可以将双数据集概述 - 用于提取和抽象的摘要。我们公开发布了一个具有5,519个基于查询的摘要的Aquamuse数据集的特定实例，每个摘要平均与从Common Crawl的3.55亿个文档索引中选择的6个输入文档相关联。提供了对数据集的广泛评估以及基线汇总模型实验。

Summarization is the task of compressing source document(s) into coherent and succinct passages. This is a valuable tool to present users with concise and accurate sketch of the top ranked documents related to their queries. Query-based multi-document summarization (qMDS) addresses this pervasive need, but the research is severely limited due to lack of training and evaluation datasets as existing single-document and multi-document summarization datasets are inadequate in form and scale. We propose a scalable approach called AQuaMuSe to automatically mine qMDS examples from question answering datasets and large document corpora. Our approach is unique in the sense that it can general a dual dataset -- for extractive and abstractive summaries both. We publicly release a specific instance of an AQuaMuSe dataset with 5,519 query-based summaries, each associated with an average of 6 input documents selected from an index of 355M documents from Common Crawl. Extensive evaluation of the dataset along with baseline summarization model experiments are provided.

下载PDF全文

下载文献需遵守相关版权规定

论文标题