论文标题
基于样本的不确定性定量使用单个确定性神经网络
Sample-based Uncertainty Quantification with a Single Deterministic Neural Network
论文作者
论文摘要
开发准确,灵活和数值有效的不确定性量化(UQ)方法是机器学习中的基本挑战之一。以前,已经提出了一种称为Disco Nets的UQ方法(Bouchacourt等,2016),该方法通过最大程度地减少能量评分来训练神经网络。在此方法中,在$ \ mathbb {r}^{10 \ text { - } 100} $中的随机噪声向量与原始输入向量相连,以产生不同的集合预测,尽管使用了单个神经网络。尽管该方法在计算机视觉中显示出令人鼓舞的性能,但仍未探索该方法在表格数据上的回归及其如何与更新的高级UQ方法(例如NGBoost)竞争。在本文中,我们提出了一种改进的迪斯科网络的神经体系结构,该培训速度更快,更稳定,而仅使用尺寸$ \ sim \ Mathcal {O}(1)$的紧凑噪声向量。我们将这种方法基于其他现实世界表格数据集,并确认它具有竞争力甚至优于标准的UQ基准。此外,我们观察到,它表现出比具有常规平方误差训练的相同尺寸的神经网络更好的点预测性能。作为所提出方法的另一个优点,我们表明局部特征重要性计算方法(例如Shap)可以轻松地应用于预测分布的任何子区域。还提供了使用能量评分学习预测分布的有效性的新基本证明。
Development of an accurate, flexible, and numerically efficient uncertainty quantification (UQ) method is one of fundamental challenges in machine learning. Previously, a UQ method called DISCO Nets has been proposed (Bouchacourt et al., 2016), which trains a neural network by minimizing the energy score. In this method, a random noise vector in $\mathbb{R}^{10\text{--}100}$ is concatenated with the original input vector in order to produce a diverse ensemble forecast despite using a single neural network. While this method has shown promising performance on a hand pose estimation task in computer vision, it remained unexplored whether this method works as nicely for regression on tabular data, and how it competes with more recent advanced UQ methods such as NGBoost. In this paper, we propose an improved neural architecture of DISCO Nets that admits faster and more stable training while only using a compact noise vector of dimension $\sim \mathcal{O}(1)$. We benchmark this approach on miscellaneous real-world tabular datasets and confirm that it is competitive with or even superior to standard UQ baselines. Moreover we observe that it exhibits better point forecast performance than a neural network of the same size trained with the conventional mean squared error. As another advantage of the proposed method, we show that local feature importance computation methods such as SHAP can be easily applied to any subregion of the predictive distribution. A new elementary proof for the validity of using the energy score to learn predictive distributions is also provided.