论文标题
KITMUS测试:评估自然语言理解系统中多个来源的知识集成
The KITMUS Test: Evaluating Knowledge Integration from Multiple Sources in Natural Language Understanding Systems
论文作者
论文摘要
许多最先进的自然语言理解(NLU)模型基于验证的神经语言模型。这些模型通常使用来自多个来源的信息来推断。重要的推论是那些既需要背景知识的人,大概包含在模型的参数中,以及在推理时提供的特定于实例的信息。但是,在存在多个知识源的情况下,NLU模型的整合和推理能力已经在很大程度上被研究了。在这项工作中,我们提出了一个核心分辨率子任务的测试套件,该子任务需要对多个事实进行推理。这些子任务在哪些知识来源包含相关事实方面有所不同。我们还介绍了仅使用虚构知识在推理时间出现知识的子任务。我们评估数据集上的最新核心分辨率模型。我们的结果表明,几种模型都在努力推论在预处理和推理时观察到的知识。但是,通过特定于任务的培训,一部分模型展示了从多个来源集成某些知识类型的能力。尽管如此,即使是表现最好的模型似乎也遇到了困难,并且仅在推理时才可以可靠地整合知识。
Many state-of-the-art natural language understanding (NLU) models are based on pretrained neural language models. These models often make inferences using information from multiple sources. An important class of such inferences are those that require both background knowledge, presumably contained in a model's pretrained parameters, and instance-specific information that is supplied at inference time. However, the integration and reasoning abilities of NLU models in the presence of multiple knowledge sources have been largely understudied. In this work, we propose a test suite of coreference resolution subtasks that require reasoning over multiple facts. These subtasks differ in terms of which knowledge sources contain the relevant facts. We also introduce subtasks where knowledge is present only at inference time using fictional knowledge. We evaluate state-of-the-art coreference resolution models on our dataset. Our results indicate that several models struggle to reason on-the-fly over knowledge observed both at pretrain time and at inference time. However, with task-specific training, a subset of models demonstrates the ability to integrate certain knowledge types from multiple sources. Still, even the best performing models seem to have difficulties with reliably integrating knowledge presented only at inference time.