论文标题
无监督的功能通过联盟游戏理论进行分类数据排名
Unsupervised Features Ranking via Coalitional Game Theory for Categorical Data
论文作者
论文摘要
并非所有现实世界的数据都标记为标签,如果标签不可用,获得它们通常是昂贵的。此外,由于许多算法都遭受了维数的诅咒,因此将数据中的特征降低到较小的集合通常具有很大的实用性。无监督的功能选择旨在减少功能的数量,通常使用特征重要的得分来量化单个功能与手头任务的相关性。这些分数只能基于变量的分布及其相互作用的量化。以前的文献主要研究异常检测和簇,未能解决冗余问题。我们建议对功能之间的相关性进行评估,以计算特征重要的分数,代表单个特征在解释数据集结构中的贡献。 基于联盟游戏理论,我们的功能重要性得分包括冗余意识的概念,使其成为实现无冗余特征选择的工具。我们表明,派生功能的选择在降低冗余率的同时最大化数据中包含的信息方面优于竞争方法。我们还引入了该算法的近似版本,以降低Shapley Values计算的复杂性。
Not all real-world data are labeled, and when labels are not available, it is often costly to obtain them. Moreover, as many algorithms suffer from the curse of dimensionality, reducing the features in the data to a smaller set is often of great utility. Unsupervised feature selection aims to reduce the number of features, often using feature importance scores to quantify the relevancy of single features to the task at hand. These scores can be based only on the distribution of variables and the quantification of their interactions. The previous literature, mainly investigating anomaly detection and clusters, fails to address the redundancy-elimination issue. We propose an evaluation of correlations among features to compute feature importance scores representing the contribution of single features in explaining the dataset's structure. Based on Coalitional Game Theory, our feature importance scores include a notion of redundancy awareness making them a tool to achieve redundancy-free feature selection. We show that the deriving features' selection outperforms competing methods in lowering the redundancy rate while maximizing the information contained in the data. We also introduce an approximated version of the algorithm to reduce the complexity of Shapley values' computations.