论文标题

聚类后​​差异测试:有效的推理和实际考虑

Post-clustering difference testing: valid inference and practical considerations

论文作者

Hivert, Benjamin, Agniel, Denis, Thiébaut, Rodolphe, Hejblum, Boris P

论文摘要

聚类是无监督分析方法的一部分,这些方法包括将样品分组为均匀和单独的观测值亚组,也称为簇。为了解释群集,统计假设检验通常用于推断出显着将估计簇彼此分开的变量。但是,由于假设源自聚类结果,因此考虑了数据驱动的假设。数据的这种双重使用导致传统的假设检验无法控制I型错误率,尤其是因为聚类过程中的不确定性及其可能造成的人工差异。我们提出了三个新的统计假设检验,以解释聚类过程。我们的测试通过仅识别包含真正信号分离观测值的真实信号的变量来有效地控制I型错误率。

Clustering is part of unsupervised analysis methods that consist in grouping samples into homogeneous and separate subgroups of observations also called clusters. To interpret the clusters, statistical hypothesis testing is often used to infer the variables that significantly separate the estimated clusters from each other. However, data-driven hypotheses are considered for the inference process, since the hypotheses are derived from the clustering results. This double use of the data leads traditional hypothesis test to fail to control the Type I error rate particularly because of uncertainty in the clustering process and the potential artificial differences it could create. We propose three novel statistical hypothesis tests which account for the clustering process. Our tests efficiently control the Type I error rate by identifying only variables that contain a true signal separating groups of observations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源