迈向统一的对话系统评估：当前评估协议的全面分析

论文标题

迈向统一的对话系统评估：当前评估协议的全面分析

Towards Unified Dialogue System Evaluation: A Comprehensive Analysis of Current Evaluation Protocols

论文作者

Finch, Sarah E., Choi, Jinho D.

论文摘要

由于基于对话AI的对话管理越来越多地成为一个热门话题，因此对标准化和可靠的评估程序的需求变得更加紧迫。当前的状况提出了各种评估协议，以评估面向聊天的对话管理系统，这使得很难跨不同方法进行公平的比较研究并获得对其价值观的深入了解。为了促进这项研究，必须制定更强大的评估协议。本文介绍了对对话系统的自动化和人类评估方法的全面综合，确定了它们的缺点，同时积累了朝着最有效的评估维度积累证据。对过去两年中总共有20篇论文进行了调查，以分析三种类型的评估协议：自动化，静态和交互式。最后，将这些论文中使用的评估维度与我们有关2020年Alexa奖收集的系统用户对话数据的专家评估进行了比较。

As conversational AI-based dialogue management has increasingly become a trending topic, the need for a standardized and reliable evaluation procedure grows even more pressing. The current state of affairs suggests various evaluation protocols to assess chat-oriented dialogue management systems, rendering it difficult to conduct fair comparative studies across different approaches and gain an insightful understanding of their values. To foster this research, a more robust evaluation protocol must be set in place. This paper presents a comprehensive synthesis of both automated and human evaluation methods on dialogue systems, identifying their shortcomings while accumulating evidence towards the most effective evaluation dimensions. A total of 20 papers from the last two years are surveyed to analyze three types of evaluation protocols: automated, static, and interactive. Finally, the evaluation dimensions used in these papers are compared against our expert evaluation on the system-user dialogue data collected from the Alexa Prize 2020.

下载PDF全文

下载文献需遵守相关版权规定

论文标题