论文标题
数据-IQ:表征表格数据中具有异质结果的亚组
Data-IQ: Characterizing subgroups with heterogeneous outcomes in tabular data
论文作者
论文摘要
平均而言,高模型性能可以隐藏模型在数据子组上的系统性不佳。我们考虑了表面环境,它浮出水面异质性的独特问题 - 这在医疗保健等领域很普遍,在医疗保健方面,具有相似特征的患者可能会带来不同的结果,从而使可靠的预测具有挑战性。为了解决这个问题,我们提出了data-iq,这是将框架系统地将示例分为亚组的框架。我们通过根据培训期间的预测信心,重要的是,我们在训练过程中分析单个示例的行为来做到这一点。捕获息肉的不确定性允许有原则的表征,然后将数据示例分层分为三个不同的亚组(容易,模棱两可,硬)。我们在实验中证明了data-iq对四个现实世界医学数据集的好处。我们表明,与基线相比,数据-IQ对示例的表征对于类似性能(但不同的)模型的变化最为强大。由于Data-IQ可以与任何ML模型(包括神经网络,梯度增强等)一起使用,因此该属性可确保数据表征的一致性,同时允许灵活的模型选择。再进一步,我们证明了子组使我们能够构建功能采集和数据集选择的新方法。此外,我们强调了子组如何为可靠的模型使用提供信息,并指出模棱两可的亚组对模型概括的重大影响。
High model performance, on average, can hide that models may systematically underperform on subgroups of the data. We consider the tabular setting, which surfaces the unique issue of outcome heterogeneity - this is prevalent in areas such as healthcare, where patients with similar features can have different outcomes, thus making reliable predictions challenging. To tackle this, we propose Data-IQ, a framework to systematically stratify examples into subgroups with respect to their outcomes. We do this by analyzing the behavior of individual examples during training, based on their predictive confidence and, importantly, the aleatoric (data) uncertainty. Capturing the aleatoric uncertainty permits a principled characterization and then subsequent stratification of data examples into three distinct subgroups (Easy, Ambiguous, Hard). We experimentally demonstrate the benefits of Data-IQ on four real-world medical datasets. We show that Data-IQ's characterization of examples is most robust to variation across similarly performant (yet different) models, compared to baselines. Since Data-IQ can be used with any ML model (including neural networks, gradient boosting etc.), this property ensures consistency of data characterization, while allowing flexible model selection. Taking this a step further, we demonstrate that the subgroups enable us to construct new approaches to both feature acquisition and dataset selection. Furthermore, we highlight how the subgroups can inform reliable model usage, noting the significant impact of the Ambiguous subgroup on model generalization.