幽灵：使用引用信息变形金刚的文档级表示学习

论文标题

幽灵：使用引用信息变形金刚的文档级表示学习

SPECTER: Document-level Representation Learning using Citation-informed Transformers

论文作者

Cohan, Arman, Feldman, Sergey, Beltagy, Iz, Downey, Doug, Weld, Daniel S.

论文摘要

表示学习是自然语言处理系统的关键要素。最近的变压器语言模型（例如Bert）学习了强大的文本表示，但是这些模型针对令牌和句子级训练目标，并且不利用有关互联段相关性的信息，这限制了其文档级表示功能。对于科学文档（例如分类和建议）的应用，嵌入功率强大的最终任务绩效。我们提出了Spectre，这是一种基于预处理文档语言模型在强大的文档级相关性的信号：引文图上生成文档级嵌入科学文档的新方法。与现有的验证语言模型不同，Spectre可以轻松地应用于下游应用程序，而无需特定于任务的微调。此外，为了鼓励对文档级模型进行进一步的研究，我们介绍了SCIDOCs，这是一个新的评估基准，该基准由七个文档级任务组成，从引文预测到文档分类和建议。我们表明，Spectre在基准上的表现优于各种竞争基线。

Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications on scientific documents, such as classification and recommendation, the embeddings power strong performance on end tasks. We propose SPECTER, a new method to generate document-level embedding of scientific documents based on pretraining a Transformer language model on a powerful signal of document-level relatedness: the citation graph. Unlike existing pretrained language models, SPECTER can be easily applied to downstream applications without task-specific fine-tuning. Additionally, to encourage further research on document-level models, we introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. We show that SPECTER outperforms a variety of competitive baselines on the benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题