论文标题
基因组数据集的高维特征选择
High-Dimensional Feature Selection for Genomic Datasets
论文作者
论文摘要
机器学习和模式识别的一个核心问题是识别最重要特征的过程。在本文中,我们提供了一种新的功能选择方法(DRPT),该方法首先删除无关的功能,然后检测其余特征之间的相关性。令$ d = [a \ mid \ mathbf {b}] $是一个数据集,其中$ \ mathbf {b} $是类标签,$ a $是矩阵的矩阵,其列是功能。我们使用最小二乘方法和$ a $的伪内解决了$ a \ mathbf {x} = \ mathbf {b} $。 $ \ mathbf {x} $的每个组件都可以视为分配给相应列的重量(功能)。我们根据$ \ mathbf {x} $的本地最大值定义阈值,并删除那些权重小于阈值的功能。 为了检测减少矩阵中的相关性,我们仍然称为$ a $,我们认为扰动$ \ tilde a $ a $ a $。我们证明,相关性是用$δ\ mathbf {x} = \ mid \ mathbf {x} - \ tilde {\ mathbf {x}} \ mid $,其中$ \ tilde {\ tilde {\ mathbf {x}} $是最小的Quares preacter solues $ \ tilde a \ tilde {\ mathbf {x}} = \ mathbf {b} $。我们首先基于$δ\ Mathbf {x} $群集功能,然后使用功能的熵。最后,根据每个子集群的重量和熵选择一个特征。 DRPT的有效性已通过进行一系列比较,并使用十个基因数据集的七种最先进的特征选择方法进行了验证,范围从9,117到267,604个功能。结果表明,与每种特征选择算法相比,DRPT的性能在几个方面都是有利的。 \ e
A central problem in machine learning and pattern recognition is the process of recognizing the most important features. In this paper, we provide a new feature selection method (DRPT) that consists of first removing the irrelevant features and then detecting correlations between the remaining features. Let $D=[A\mid \mathbf{b}]$ be a dataset, where $\mathbf{b}$ is the class label and $A$ is a matrix whose columns are the features. We solve $A\mathbf{x} = \mathbf{b}$ using the least squares method and the pseudo-inverse of $A$. Each component of $\mathbf{x}$ can be viewed as an assigned weight to the corresponding column (feature). We define a threshold based on the local maxima of $\mathbf{x}$ and remove those features whose weights are smaller than the threshold. To detect the correlations in the reduced matrix, which we still call $A$, we consider a perturbation $\tilde A$ of $A$. We prove that correlations are encoded in $Δ\mathbf{x}=\mid \mathbf{x} -\tilde{\mathbf{x}}\mid $, where $\tilde{\mathbf{x}}$ is the least quares solution of $\tilde A\tilde{\mathbf{x}}=\mathbf{b}$. We cluster features first based on $Δ\mathbf{x}$ and then using the entropy of features. Finally, a feature is selected from each sub-cluster based on its weight and entropy. The effectiveness of DRPT has been verified by performing a series of comparisons with seven state-of-the-art feature selection methods over ten genetic datasets ranging up from 9,117 to 267,604 features. The results show that, over all, the performance of DRPT is favorable in several aspects compared to each feature selection algorithm. \e