论文标题
人类和语言模型中的任务歧义
Task Ambiguity in Humans and Language Models
论文作者
论文摘要
语言模型最近在广泛的NLP基准测试中取得了良好的性能。但是,与基准不同,现实世界的任务通常规定很差,并且代理必须从上下文,说明和示例的结合中推断出用户的预期行为。我们通过提出Ambibench来研究人类和模型如何在这种任务歧义面前行事,这是六个模棱两可的分类任务的新基准。我们通过查看他们使用不同程度歧义的指令以及2)标记的示例来评估他们在Ambibench上评估人类和模型。我们发现,模型缩放(至175b参数)和与人类反馈数据的培训的组合使模型能够接近或超过跨任务的人类参与者的准确性,但是任何一个单独的人都不够。此外,我们还展示了如何通过对少数模棱两可的内在示例进行填充而在没有大规模的人类反馈培训的情况下进行训练的语言模型的准确性,从而为教学模型提供了一个有希望的方向,可以在面对歧义的情况下进行概括。
Language models have recently achieved strong performance across a wide range of NLP benchmarks. However, unlike benchmarks, real world tasks are often poorly specified, and agents must deduce the user's intended behavior from a combination of context, instructions, and examples. We investigate how both humans and models behave in the face of such task ambiguity by proposing AmbiBench, a new benchmark of six ambiguously-specified classification tasks. We evaluate humans and models on AmbiBench by seeing how well they identify the intended task using 1) instructions with varying degrees of ambiguity, and 2) different numbers of labeled examples. We find that the combination of model scaling (to 175B parameters) and training with human feedback data enables models to approach or exceed the accuracy of human participants across tasks, but that either one alone is not sufficient. In addition, we show how to dramatically improve the accuracy of language models trained without large-scale human feedback training by finetuning on a small number of ambiguous in-context examples, providing a promising direction for teaching models to generalize well in the face of ambiguity.