论文标题
解构以重建开放域对话系统的可配置评估度量
Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems
论文作者
论文摘要
已经提出了许多自动评估指标,以在开放域对话中评分响应的总体质量。通常,整体质量包括各个方面,例如相关性,特异性和同理心,每个方面的重要性都根据任务而不同。例如,在食品顺序的对话任务中,特异性是必不可少的,而在语言教学的对话系统中则优选流利度。但是,现有的指标并非旨在应对这种灵活性。例如,BLEU得分从根本上仅依赖于单词重叠,而Bertscore依赖于参考和候选响应之间的语义相似性。因此,他们不能保证捕获所需的方面,即特异性。为了设计一个可以灵活的任务的度量,我们首先提出通过将它们分为三组来使这些质量可管理:可理解性,敏感性和可爱性,而可爱性是对任务必不可少的质量的组合。我们还提出了一种简单的方法来复合每个方面的指标,以获得一个称为USL-H的单一度量,该指标代表了层次结构中的可理解性,敏感性和可爱性。我们证明,USL-H得分与人类判断力达到了良好的相关性,并保持其在不同方面和指标上的可配置性。
Many automatic evaluation metrics have been proposed to score the overall quality of a response in open-domain dialogue. Generally, the overall quality is comprised of various aspects, such as relevancy, specificity, and empathy, and the importance of each aspect differs according to the task. For instance, specificity is mandatory in a food-ordering dialogue task, whereas fluency is preferred in a language-teaching dialogue system. However, existing metrics are not designed to cope with such flexibility. For example, BLEU score fundamentally relies only on word overlapping, whereas BERTScore relies on semantic similarity between reference and candidate response. Thus, they are not guaranteed to capture the required aspects, i.e., specificity. To design a metric that is flexible to a task, we first propose making these qualities manageable by grouping them into three groups: understandability, sensibleness, and likability, where likability is a combination of qualities that are essential for a task. We also propose a simple method to composite metrics of each aspect to obtain a single metric called USL-H, which stands for Understandability, Sensibleness, and Likability in Hierarchy. We demonstrated that USL-H score achieves good correlations with human judgment and maintains its configurability towards different aspects and metrics.