AI教师测试：测量搅拌机和GPT-3在教育对话中的教学能力

论文标题

AI教师测试：测量搅拌机和GPT-3在教育对话中的教学能力

The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues

论文作者

Tack, Anaïs, Piech, Chris

论文摘要

我们如何测试最先进的生成模型，例如Blender和GPT-3，是好的AI老师，能够在教育对话中回复学生？设计AI教师的测试具有挑战性：尽管急需评估方法，但没有现成的解决方案来衡量教学能力。本文报告了AI教师测试的首次尝试。我们围绕着洞察力建立了一个解决方案，即您可以在现实世界中与人类教师并行运行对话代理，模拟不同的代理人如何对学生的反应，并根据三种能力比较这些对应的回答：像老师一样说话，了解学生，帮助学生。我们的方法基于教育中比较判断的可靠性，并使用概率模型和贝叶斯抽样来推断教学能力的估计。我们发现，即使对话剂（尤其是搅拌器）在对话摄取方面的表现良好，但在几个教学维度上，它们的差异性比真实的教师差，尤其是在有帮助的方面（Blender：δ能力= -0.75; gpt -3; gpt -3：δ能力= -0.93）。

How can we test whether state-of-the-art generative models, such as Blender and GPT-3, are good AI teachers, capable of replying to a student in an educational dialogue? Designing an AI teacher test is challenging: although evaluation methods are much-needed, there is no off-the-shelf solution to measuring pedagogical ability. This paper reports on a first attempt at an AI teacher test. We built a solution around the insight that you can run conversational agents in parallel to human teachers in real-world dialogues, simulate how different agents would respond to a student, and compare these counterpart responses in terms of three abilities: speak like a teacher, understand a student, help a student. Our method builds on the reliability of comparative judgments in education and uses a probabilistic model and Bayesian sampling to infer estimates of pedagogical ability. We find that, even though conversational agents (Blender in particular) perform well on conversational uptake, they are quantifiably worse than real teachers on several pedagogical dimensions, especially with regard to helpfulness (Blender: Δ ability = -0.75; GPT-3: Δ ability = -0.93).

下载PDF全文

下载文献需遵守相关版权规定

论文标题