论文标题

部分可观测时空混沌系统的无模型预测

A Principled Evaluation Protocol for Comparative Investigation of the Effectiveness of DNN Classification Models on Similar-but-non-identical Datasets

论文作者

Anzaku, Esla Timothy, Wang, Haohan, Van Messem, Arnout, De Neve, Wesley

论文摘要

深度神经网络(DNN)模型越来越多地使用新的复制测试数据集进行评估,这些数据集已仔细创建,类似于较旧的和流行的基准数据集。但是,与期望相反,DNN分类模型在这些复制测试数据集的准确性上显示出显着,一致且在很大程度上无法解释的降解。虽然流行的评估方法是通过利用相应测试数据集中可用的所有数据点来评估模型的准确性,但我们认为这样做会阻碍我们充分捕获DNN模型的行为以及对其准确性的现实期望。因此,我们提出了一种原则性评估协议,该协议适用于在多个测试数据集上对DNN模型的准确性进行比较研究,利用可以使用不同标准(包括与不确定性相关信息)选择的数据点子集的子集。通过使用此新评估协议,我们确定了(1)CIFAR-10和Imagenet数据集上$ 564 $ DNN型号的准确性,以及(2)其复制数据集。我们的实验结果表明,与已发表的作品中报道的准确性退化相比,已观察到的基准数据集和它们的复制之间观察到的准确性降解始终较低(即模型在复制测试数据集上的表现更好),这些已发表的作品依赖于未利用不利于不确定的信息的常规评估方法。

Deep Neural Network (DNN) models are increasingly evaluated using new replication test datasets, which have been carefully created to be similar to older and popular benchmark datasets. However, running counter to expectations, DNN classification models show significant, consistent, and largely unexplained degradation in accuracy on these replication test datasets. While the popular evaluation approach is to assess the accuracy of a model by making use of all the datapoints available in the respective test datasets, we argue that doing so hinders us from adequately capturing the behavior of DNN models and from having realistic expectations about their accuracy. Therefore, we propose a principled evaluation protocol that is suitable for performing comparative investigations of the accuracy of a DNN model on multiple test datasets, leveraging subsets of datapoints that can be selected using different criteria, including uncertainty-related information. By making use of this new evaluation protocol, we determined the accuracy of $564$ DNN models on both (1) the CIFAR-10 and ImageNet datasets and (2) their replication datasets. Our experimental results indicate that the observed accuracy degradation between established benchmark datasets and their replications is consistently lower (that is, models do perform better on the replication test datasets) than the accuracy degradation reported in published works, with these published works relying on conventional evaluation approaches that do not utilize uncertainty-related information.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源