论文标题
在线性回归中证明数据偏置鲁棒性
Certifying Data-Bias Robustness in Linear Regression
论文作者
论文摘要
数据集通常由于人为错误和社会偏见而包含不准确性,这些不准确会影响在此类数据集上训练的模型的结果。我们提出了一种用于证明线性回归模型是否在训练数据集中标记偏差的技术,即是否将扰动与培训数据集的标签有界化导致改变测试点预测的模型。我们展示了如何为单个测试点解决此问题,并提供了一种近似但更可扩展的方法,该方法不需要提前了解测试点。我们广泛评估这两种技术,并发现基于回归和分类的线性模型通常显示出高水平的偏见。但是,我们还发现了偏见的差距,例如某些数据集上某些偏差假设的高水平的非舒适性。总体而言,我们的方法可以作为何时信任或提问模型的输出的指南。
Datasets typically contain inaccuracies due to human error and societal biases, and these inaccuracies can affect the outcomes of models trained on such datasets. We present a technique for certifying whether linear regression models are pointwise-robust to label bias in the training dataset, i.e., whether bounded perturbations to the labels of a training dataset result in models that change the prediction of test points. We show how to solve this problem exactly for individual test points, and provide an approximate but more scalable method that does not require advance knowledge of the test point. We extensively evaluate both techniques and find that linear models -- both regression- and classification-based -- often display high levels of bias-robustness. However, we also unearth gaps in bias-robustness, such as high levels of non-robustness for certain bias assumptions on some datasets. Overall, our approach can serve as a guide for when to trust, or question, a model's output.