论文标题
调试模型解释测试
Debugging Tests for Model Explanations
论文作者
论文摘要
我们研究事后模型解释是否有效诊断模型错误 - 模型调试。为了应对解释模型预测的挑战,已经提出了大量的解释方法。尽管使用越来越多,但目前尚不清楚它们是否有效。首先,我们根据其来源对\ textit {bugs}分类为:〜\ textit {数据,模型和测试时间}污染错误。对于几种解释方法,我们评估它们的能力:检测伪造的相关伪像(数据污染),诊断错误的标签训练示例(数据污染),区分(部分)重新定位的模型和受过训练的模型(模型污染)(模型污染)以及检测到分数输入输入(测试时间污染)。我们发现所测试的方法能够诊断出虚假的背景错误,但不能最终确定标签错误的培训示例。此外,修改后传播算法的一类方法对于深网的较高层参数是不变的。因此,无效诊断模型污染。我们通过人类学科研究对分析进行补充,发现受试者无法使用属性识别有缺陷的模型,而是主要依靠模型预测。综上所述,我们的结果为从业人员和研究人员提供了指导,将解释作为模型调试工具。
We investigate whether post-hoc model explanations are effective for diagnosing model errors--model debugging. In response to the challenge of explaining a model's prediction, a vast array of explanation methods have been proposed. Despite increasing use, it is unclear if they are effective. To start, we categorize \textit{bugs}, based on their source, into:~\textit{data, model, and test-time} contamination bugs. For several explanation methods, we assess their ability to: detect spurious correlation artifacts (data contamination), diagnose mislabeled training examples (data contamination), differentiate between a (partially) re-initialized model and a trained one (model contamination), and detect out-of-distribution inputs (test-time contamination). We find that the methods tested are able to diagnose a spurious background bug, but not conclusively identify mislabeled training examples. In addition, a class of methods, that modify the back-propagation algorithm are invariant to the higher layer parameters of a deep network; hence, ineffective for diagnosing model contamination. We complement our analysis with a human subject study, and find that subjects fail to identify defective models using attributions, but instead rely, primarily, on model predictions. Taken together, our results provide guidance for practitioners and researchers turning to explanations as tools for model debugging.