论文标题
评估人类语言模型的相互作用
Evaluating Human-Language Model Interaction
论文作者
论文摘要
语言模型(LMS)的许多真实应用程序(例如编写协助和代码自动完成)都涉及人类LM互动。但是,大多数基准是非相互作用的,因为模型会产生不参与的输出。为了评估Human-LM相互作用,我们开发了一个新的框架,基于人类语言的互动评估(HALIE),该评估定义了交互式系统的组成部分和设计评估指标时要考虑的尺寸。与标准的非交互式评估相比,Halie捕获(i)交互式过程,不仅是最终输出; (ii)第一人称主观经验,而不仅仅是第三方评估; (iii)超越质量的偏好概念(例如,享受和所有权)。然后,我们设计了五项任务以涵盖不同形式的互动:社交对话,问题回答,填字游戏,摘要和隐喻生成。使用四个最先进的LMS(OpenAI的GPT-3和AI21 Labs的侏罗纪-1变体),我们发现更好的非相互作用性能并不总是会转化为更好的人类LM相互作用。特别是,我们强调了三种情况,即非相互作用和交互式指标的结果分歧,并强调了人类LM相互作用在LM评估中的重要性。
Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.