论文标题

Multiwoz是解决的任务吗?与用户模拟器的交互式TOD评估框架

Is MultiWOZ a Solved Task? An Interactive TOD Evaluation Framework with User Simulator

论文作者

Cheng, Qinyuan, Li, Linyang, Quan, Guofeng, Gao, Feng, Mou, Xiaofeng, Qiu, Xipeng

论文摘要

在最近的研究中,面向任务的对话(TOD)系统正在引起越来越多的关注。当前的方法着重于构建预训练的模型或微调策略,而TOD的评估受到政策不匹配问题的限制。也就是说,在评估过程中,用户话语来自带注释的数据集,而这些话语应与以前的响应进行交互,除了带注释的文本外,还可以选择许多替代方案。因此,在这项工作中,我们为TOD提出了一个交互式评估框架。我们首先基于预训练的模型构建面向目标的用户模拟器,然后使用用户模拟器与对话系统进行交互以生成对话。此外,我们介绍一个句子级别和会话级分数,以测量交互式评估中的句子流利度和会话连贯性。实验结果表明,由我们提出的用户模拟器培训的基于RL的TOD系统可以在多WOZ数据集的交互式评估中实现近98%的信息和成功率,而拟议的分数除了信息和成功率外,还可以衡量响应质量。我们希望我们的工作将鼓励基于模拟器的TOD任务中的交互式评估。

Task-Oriented Dialogue (TOD) systems are drawing more and more attention in recent studies. Current methods focus on constructing pre-trained models or fine-tuning strategies while the evaluation of TOD is limited by a policy mismatch problem. That is, during evaluation, the user utterances are from the annotated dataset while these utterances should interact with previous responses which can have many alternatives besides annotated texts. Therefore, in this work, we propose an interactive evaluation framework for TOD. We first build a goal-oriented user simulator based on pre-trained models and then use the user simulator to interact with the dialogue system to generate dialogues. Besides, we introduce a sentence-level and a session-level score to measure the sentence fluency and session coherence in the interactive evaluation. Experimental results show that RL-based TOD systems trained by our proposed user simulator can achieve nearly 98% inform and success rates in the interactive evaluation of MultiWOZ dataset and the proposed scores measure the response quality besides the inform and success rates. We are hoping that our work will encourage simulator-based interactive evaluations in the TOD task.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源