因子分析和主成分分析中组件数量的置信区间通过子采样

论文标题

因子分析和主成分分析中组件数量的置信区间通过子采样

Confidence Intervals for the Number of Components in Factor Analysis and Principal Components Analysis via Subsampling

论文作者

Jha, Chetkar, Barnett, Ian

论文摘要

因子分析（FA）和主成分分析（PCA）是汇总和解释多元数据集变异性的流行统计方法。默认情况下，FA和PCA假定要知道的组件或因素的数量\ emph {a先验}。但是，实际上，用户首先估计因素或组件的数量，然后使用点估计进行FA和PCA分析。因此，实际上，用户忽略了因素或组件数量的点估计值的任何不确定性。对于该数据估计中不确定的数据集，对置信区间的置信区间数量或组件的数量进行FA和PCA分析是谨慎的。我们通过提出一种基于基本采样的数据密集型方法来解决此问题，以估算FA和PCA组件数量的置信区间。我们研究了提出的置信区间的覆盖范围概率，并提供了有关置信区间准确性的非反应理论保证。作为副产品，当在因子模型下生成数据矩阵时，我们将用于样品协方差矩阵的峰值特征值的一阶\ emph {edgeworth膨胀}。我们还通过数值模拟来证明我们的方法的有用性，并应用我们的方法来估算人类基因组多样性项目基因分型数据集的置信区间的数量。

Factor analysis (FA) and principal component analysis (PCA) are popular statistical methods for summarizing and explaining the variability in multivariate datasets. By default, FA and PCA assume the number of components or factors to be known \emph{a priori}. However, in practice the users first estimate the number of factors or components and then perform FA and PCA analyses using the point estimate. Therefore, in practice the users ignore any uncertainty in the point estimate of the number of factors or components. For datasets where the uncertainty in the point estimate is not ignorable, it is prudent to perform FA and PCA analyses for the range of positive integer values in the confidence intervals for the number of factors or components. We address this problem by proposing a subsampling-based data-intensive approach for estimating confidence intervals for the number of components in FA and PCA. We study the coverage probability of the proposed confidence intervals and provide non-asymptotic theoretical guarantees concerning the accuracy of the confidence intervals. As a byproduct, we derive the first-order \emph{Edgeworth expansion} for spiked eigenvalues of the sample covariance matrix when the data matrix is generated under a factor model. We also demonstrate the usefulness of our approach through numerical simulations and by applying our approach for estimating confidence intervals for the number of factors of the genotyping dataset of the Human Genome Diversity Project.

下载PDF全文

下载文献需遵守相关版权规定

论文标题