论文标题
使用课程学习的生成实体打字
Generative Entity Typing with Curriculum Learning
论文作者
论文摘要
实体键入旨在将类型分配给给定文本中的实体提及。传统的基于分类的实体键入范例具有两个难以符合的缺点:1)它无法将实体分配给超出预定类型的类型的类型,而2)几乎无法处理几个射击和零射击情况,在这些情况下,许多长尾类型只有很少甚至没有培训实例。为了克服这些缺点,我们提出了一种新颖的生成实体键入(GET)范式:给定带有实体提及的文本,该实体在本文中扮演的角色的多种类型是用预训练的语言模型(PLM)生成的。但是,在对实体键入数据集进行微调后,PLM倾向于生成粗粒类型。此外,我们只有一个异质培训数据,包括一小部分人类通知数据和大部分自动生成但低质量的数据。为了解决这些问题,我们采用课程学习(CL)来训练我们的GET模型根据异质数据进行训练,在这些数据中,根据对粒度和数据异质性类型的理解,可以通过自定进度学习来自我调整课程。我们对不同语言和下游任务数据集进行的广泛实验证明了我们的GET模型优于最先进的实体键入模型。该代码已在https://github.com/siyuyuan/get上发布。
Entity typing aims to assign types to the entity mentions in given texts. The traditional classification-based entity typing paradigm has two unignorable drawbacks: 1) it fails to assign an entity to the types beyond the predefined type set, and 2) it can hardly handle few-shot and zero-shot situations where many long-tail types only have few or even no training instances. To overcome these drawbacks, we propose a novel generative entity typing (GET) paradigm: given a text with an entity mention, the multiple types for the role that the entity plays in the text are generated with a pre-trained language model (PLM). However, PLMs tend to generate coarse-grained types after fine-tuning upon the entity typing dataset. Besides, we only have heterogeneous training data consisting of a small portion of human-annotated data and a large portion of auto-generated but low-quality data. To tackle these problems, we employ curriculum learning (CL) to train our GET model upon the heterogeneous data, where the curriculum could be self-adjusted with the self-paced learning according to its comprehension of the type granularity and data heterogeneity. Our extensive experiments upon the datasets of different languages and downstream tasks justify the superiority of our GET model over the state-of-the-art entity typing models. The code has been released on https://github.com/siyuyuan/GET.