在数据集复制中识别统计偏差

论文标题

在数据集复制中识别统计偏差

Identifying Statistical Bias in Dataset Replication

论文作者

Engstrom, Logan, Ilyas, Andrew, Santurkar, Shibani, Tsipras, Dimitris, Steinhardt, Jacob, Madry, Aleksander

论文摘要

数据集复制是评估特定基准测试准确性的提高是否对应于改进模型可靠地概括的能力的有用工具。在这项工作中，我们提出了不直觉但重要的方法，即数据集复制引入统计偏差，歪曲了由此产生的观察结果。我们研究Imagenet-V2，即即使在控制标准的数据质量衡量标准的人类量度测量之后，模型的Imagenet数据集的复制也显示出明显的准确性（11-14％）。我们表明，在纠正已确定的统计偏差之后，只有估计的$ 3.6 \％\ pm 1.5 \％$的原始$ 11.7 \％\％\ pm 1.0 \％$ $精度下降仍未计入。我们最终提出了识别和避免数据集复制中偏见的具体建议。我们的研究的代码可在http://github.com/madrylab/dataset-replication-analysis上公开获得。

Dataset replication is a useful tool for assessing whether improvements in test accuracy on a specific benchmark correspond to improvements in models' ability to generalize reliably. In this work, we present unintuitive yet significant ways in which standard approaches to dataset replication introduce statistical bias, skewing the resulting observations. We study ImageNet-v2, a replication of the ImageNet dataset on which models exhibit a significant (11-14%) drop in accuracy, even after controlling for a standard human-in-the-loop measure of data quality. We show that after correcting for the identified statistical bias, only an estimated $3.6\% \pm 1.5\%$ of the original $11.7\% \pm 1.0\%$ accuracy drop remains unaccounted for. We conclude with concrete recommendations for recognizing and avoiding bias in dataset replication. Code for our study is publicly available at http://github.com/MadryLab/dataset-replication-analysis .

下载PDF全文

下载文献需遵守相关版权规定

论文标题