论文标题
Open4Business(O4B):用于总结业务文件的开放访问数据集
Open4Business(O4B): An Open Access Dataset for Summarizing Business Documents
论文作者
论文摘要
自动摘要的微调深度学习模型中的一个主要挑战是需要大型域特定数据集。从在线出版物等资源中策划此类数据的障碍之一是导航适用于其重新使用的许可法规,尤其是用于商业目的。结果,尽管有几个商业期刊的可用性,但没有大规模数据集用于总结业务文件。在这项工作中,我们介绍了Open4Business(O4B),该数据集由17,458个开放访问商业文章及其参考摘要。该数据集引入了业务领域中汇总的新挑战,与其他现有数据集相比,需要高度抽象和更简洁的摘要。此外,我们评估了它上的现有模型,因此表明在O4B上训练的模型和7倍较大的非打开访问数据集在摘要时实现了可比的性能。我们发布数据集,以及可以利用的代码,以类似地收集多个域的数据。
A major challenge in fine-tuning deep learning models for automatic summarization is the need for large domain specific datasets. One of the barriers to curating such data from resources like online publications is navigating the license regulations applicable to their re-use, especially for commercial purposes. As a result, despite the availability of several business journals there are no large scale datasets for summarizing business documents. In this work, we introduce Open4Business(O4B),a dataset of 17,458 open access business articles and their reference summaries. The dataset introduces a new challenge for summarization in the business domain, requiring highly abstractive and more concise summaries as compared to other existing datasets. Additionally, we evaluate existing models on it and consequently show that models trained on O4B and a 7x larger non-open access dataset achieve comparable performance on summarization. We release the dataset, along with the code which can be leveraged to similarly gather data for multiple domains.