论文标题
解释质量评估中的挑战
Challenges in Explanation Quality Evaluation
论文作者
论文摘要
尽管许多研究重点是产生解释,但仍不清楚如何以有意义的方式评估产生的解释质量。当今的主要方法是使用代理分数量化解释,将解释与(人类通知的)黄金解释进行比较。这种方法假设达到较高代理分数的解释也将为人类用户带来更大的好处。在本文中,我们提出了这种方法的问题。具体而言,我们(i)提出了解释质量的理想特征,(ii)描述当前的评估实践如何违反它们,(iii)通过众包案例研究的初步证据来支持我们的论点,在该案例研究中,我们研究了最先进的可解释的解释问题的解释质量。我们发现,代理分数与人类质量评级的相关性很差,而且越来越多地使用它们的频率(即遵循Goodhart定律)。最后,我们提出指南,以实现对解释的有意义的评估,以推动为人类用户提供切实利益的系统的开发。
While much research focused on producing explanations, it is still unclear how the produced explanations' quality can be evaluated in a meaningful way. Today's predominant approach is to quantify explanations using proxy scores which compare explanations to (human-annotated) gold explanations. This approach assumes that explanations which reach higher proxy scores will also provide a greater benefit to human users. In this paper, we present problems of this approach. Concretely, we (i) formulate desired characteristics of explanation quality, (ii) describe how current evaluation practices violate them, and (iii) support our argumentation with initial evidence from a crowdsourcing case study in which we investigate the explanation quality of state-of-the-art explainable question answering systems. We find that proxy scores correlate poorly with human quality ratings and, additionally, become less expressive the more often they are used (i.e. following Goodhart's law). Finally, we propose guidelines to enable a meaningful evaluation of explanations to drive the development of systems that provide tangible benefits to human users.