论文标题

长篇文章的跨语性摘要

Long-Document Cross-Lingual Summarization

论文作者

Zheng, Shaohui, Li, Zhixu, Wang, Jiaan, Qu, Jianfeng, Liu, An, Zhao, Lei, Chen, Zhigang

论文摘要

跨语性摘要(CLS)旨在用一种语言以另一种语言为给定文档生成摘要。由于其在多种语言世界中的实际意义,CLS引起了广泛的研究关注。尽管已经做出了巨大的贡献,但现有的CLS工作通常专注于简短的文档,例如新闻文章,简短的对话和指南。与这些简短的文本不同,诸如学术文章和商业报告之类的长文档通常会讨论复杂的主题,并由数千个单词组成,使其对处理和总结并不繁琐。为了促进对长文档的CLS研究,我们构建了Perseus,这是第一个长期文档CLS数据集,该数据集收集了约94K中国科学文档与英语摘要配对。 Perseus中文档的平均长度超过两千个令牌。作为一项关于长期文档CLS的初步研究,我们构建和评估各种CLS基准,包括管道和端到端方法。关于珀尔修斯的实验结果表明,端到端基线的优越性,表现优于配备了复杂机器翻译系统的强大管道模型。此外,为了提供更深入的了解,我们手动分析模型输出并讨论当前方法面临的特定挑战。我们希望我们的工作能够基准测试长期文档CL,并受益于未来的研究。

Cross-Lingual Summarization (CLS) aims at generating summaries in one language for the given documents in another language. CLS has attracted wide research attention due to its practical significance in the multi-lingual world. Though great contributions have been made, existing CLS works typically focus on short documents, such as news articles, short dialogues and guides. Different from these short texts, long documents such as academic articles and business reports usually discuss complicated subjects and consist of thousands of words, making them non-trivial to process and summarize. To promote CLS research on long documents, we construct Perseus, the first long-document CLS dataset which collects about 94K Chinese scientific documents paired with English summaries. The average length of documents in Perseus is more than two thousand tokens. As a preliminary study on long-document CLS, we build and evaluate various CLS baselines, including pipeline and end-to-end methods. Experimental results on Perseus show the superiority of the end-to-end baseline, outperforming the strong pipeline models equipped with sophisticated machine translation systems. Furthermore, to provide a deeper understanding, we manually analyze the model outputs and discuss specific challenges faced by current approaches. We hope that our work could benchmark long-document CLS and benefit future studies.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源