论文标题
无模型特征选择以促进表格数据中自动发现发散子组
Model-free feature selection to facilitate automatic discovery of divergent subgroups in tabular data
论文作者
论文摘要
以数据为中心的AI鼓励需要清洁和理解数据,以实现值得信赖的AI。现有技术(例如Automl)使自动设计和训练模型变得更加容易,但是缺乏提取以数据为中心见解的类似水平。每个功能的手动分层(例如性别)仅限于扩大较高的特征维度,可以使用自动发现发散亚组来解决。尽管如此,这些自动发现技术通常会跨越可能使用先前的特征选择步骤简化的特征的指数组合进行搜索。表格数据的现有特征选择技术通常涉及拟合特定模型以选择重要功能。但是,除需要额外的资源来设计,微调和训练模型之外,这种基于模型的选择也容易出现模型偏置和虚假相关性。在本文中,我们提出了一个基于模型和稀疏性的自动特征选择(SAFS)框架,以促进自动发现不同的亚组。与基于过滤器的选择技术不同,我们利用特征值之间的客观度量的稀疏性来排名和选择特征。我们验证了两个公开可用数据集(MIMIC-III和ALLSTATE索赔)的SAF,并将其与现有的六种功能选择方法进行了比较。 SAFS将特征选择时间的缩短为81倍和104倍,平均跨越模拟III中的现有方法,并索赔数据集。 SAFS选择的功能还显示出可以实现竞争性检测性能,例如,SAFS在索赔数据集中选择的特征中有18.3%的功能检测到的不同样本类似于使用jaccard相似性为0.95的整个功能,但检测时间减少了16倍。
Data-centric AI encourages the need of cleaning and understanding of data in order to achieve trustworthy AI. Existing technologies, such as AutoML, make it easier to design and train models automatically, but there is a lack of a similar level of capabilities to extract data-centric insights. Manual stratification of tabular data per a feature (e.g., gender) is limited to scale up for higher feature dimension, which could be addressed using automatic discovery of divergent subgroups. Nonetheless, these automatic discovery techniques often search across potentially exponential combinations of features that could be simplified using a preceding feature selection step. Existing feature selection techniques for tabular data often involve fitting a particular model in order to select important features. However, such model-based selection is prone to model-bias and spurious correlations in addition to requiring extra resource to design, fine-tune and train a model. In this paper, we propose a model-free and sparsity-based automatic feature selection (SAFS) framework to facilitate automatic discovery of divergent subgroups. Different from filter-based selection techniques, we exploit the sparsity of objective measures among feature values to rank and select features. We validated SAFS across two publicly available datasets (MIMIC-III and Allstate Claims) and compared it with six existing feature selection methods. SAFS achieves a reduction of feature selection time by a factor of 81x and 104x, averaged cross the existing methods in the MIMIC-III and Claims datasets respectively. SAFS-selected features are also shown to achieve competitive detection performance, e.g., 18.3% of features selected by SAFS in the Claims dataset detected divergent samples similar to those detected by using the whole features with a Jaccard similarity of 0.95 but with a 16x reduction in detection time.