论文标题
关于人类实时理解行为的神经语言模型的预测能力
On the Predictive Power of Neural Language Models for Human Real-Time Comprehension Behavior
论文作者
论文摘要
人类阅读行为调整为自然语言的统计数据:可以从上下文中对单词的概率估计来预测人类读取单词所花费的时间。但是,这仍然是一个悬而未决的问题,哪种计算体系结构最好地表征了确定阅读行为签名的人类实时部署的期望。在这里,我们测试了超过两个模型,独立地操纵计算架构和培训数据集大小,以了解其下一字期望的预期如何预测自然主义文本语料库的人类阅读时间行为。我们发现,在模型架构和培训数据集之间,大小尺寸的单词概要性和阅读时间之间的关系是(接近)线性的。接下来,我们评估这些模型的特征如何确定其心理测量预测能力或预测人阅读行为的能力。通常,模型的下一字期望越好,其心理测量能力越好。但是,我们发现模型体系结构之间的非平凡差异。对于任何给定的困惑,深层变压器模型和N-Gram模型通常显示出优于LSTM或结构监督的神经模型,尤其是眼动数据的较高的心理测量能力。最后,我们将模型的心理测量预测能力与其句法知识的深度进行了比较,这是通过使用受控心理语言实验的方法开发的一系列句法概括测试来衡量的。一旦控制了困惑,我们发现句法知识与预测能力之间没有显着的关系。这些结果表明,可能需要采用不同的方法来最佳地模拟自然主义阅读与行为的人类实时语言理解行为,用于针对涉及句法知识的靶向探测的受控语言材料。
Human reading behavior is tuned to the statistics of natural language: the time it takes human subjects to read a word can be predicted from estimates of the word's probability in context. However, it remains an open question what computational architecture best characterizes the expectations deployed in real time by humans that determine the behavioral signatures of reading. Here we test over two dozen models, independently manipulating computational architecture and training dataset size, on how well their next-word expectations predict human reading time behavior on naturalistic text corpora. We find that across model architectures and training dataset sizes the relationship between word log-probability and reading time is (near-)linear. We next evaluate how features of these models determine their psychometric predictive power, or ability to predict human reading behavior. In general, the better a model's next-word expectations, the better its psychometric predictive power. However, we find nontrivial differences across model architectures. For any given perplexity, deep Transformer models and n-gram models generally show superior psychometric predictive power over LSTM or structurally supervised neural models, especially for eye movement data. Finally, we compare models' psychometric predictive power to the depth of their syntactic knowledge, as measured by a battery of syntactic generalization tests developed using methods from controlled psycholinguistic experiments. Once perplexity is controlled for, we find no significant relationship between syntactic knowledge and predictive power. These results suggest that different approaches may be required to best model human real-time language comprehension behavior in naturalistic reading versus behavior for controlled linguistic materials designed for targeted probing of syntactic knowledge.