论文标题
幽灵:使用引用信息变形金刚的文档级表示学习
SPECTER: Document-level Representation Learning using Citation-informed Transformers
论文作者
论文摘要
表示学习是自然语言处理系统的关键要素。最近的变压器语言模型(例如Bert)学习了强大的文本表示,但是这些模型针对令牌和句子级训练目标,并且不利用有关互联段相关性的信息,这限制了其文档级表示功能。对于科学文档(例如分类和建议)的应用,嵌入功率强大的最终任务绩效。我们提出了Spectre,这是一种基于预处理文档语言模型在强大的文档级相关性的信号:引文图上生成文档级嵌入科学文档的新方法。与现有的验证语言模型不同,Spectre可以轻松地应用于下游应用程序,而无需特定于任务的微调。此外,为了鼓励对文档级模型进行进一步的研究,我们介绍了SCIDOCs,这是一个新的评估基准,该基准由七个文档级任务组成,从引文预测到文档分类和建议。我们表明,Spectre在基准上的表现优于各种竞争基线。
Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications on scientific documents, such as classification and recommendation, the embeddings power strong performance on end tasks. We propose SPECTER, a new method to generate document-level embedding of scientific documents based on pretraining a Transformer language model on a powerful signal of document-level relatedness: the citation graph. Unlike existing pretrained language models, SPECTER can be easily applied to downstream applications without task-specific fine-tuning. Additionally, to encourage further research on document-level models, we introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. We show that SPECTER outperforms a variety of competitive baselines on the benchmark.