使用白盒LSTMS评估归因方法

论文标题

使用白盒LSTMS评估归因方法

Evaluating Attribution Methods using White-Box LSTMs

论文作者

Hao, Yiding

论文摘要

神经网络的可解释性方法很难评估，因为我们不了解通常用于测试它们的黑框模型。本文提出了一个框架，其中使用手动构造的网络评估可解释性方法，我们称之为白框网络，其行为被理解为先验。我们通过将其应用于基于正式语言的任务来评估五种用于生成归因热图的方法。尽管我们的白盒分类器可以完美和透明地解决他们的任务，但我们发现所有五种属性方法都无法产生预期的模型解释。

Interpretability methods for neural networks are difficult to evaluate because we do not understand the black-box models typically used to test them. This paper proposes a framework in which interpretability methods are evaluated using manually constructed networks, which we call white-box networks, whose behavior is understood a priori. We evaluate five methods for producing attribution heatmaps by applying them to white-box LSTM classifiers for tasks based on formal languages. Although our white-box classifiers solve their tasks perfectly and transparently, we find that all five attribution methods fail to produce the expected model explanations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题