基于显着性的分类数据群集

论文标题

基于显着性的分类数据群集

Significance-Based Categorical Data Clustering

论文作者

Hu, Lianyu, Jiang, Mudi, Liu, Yan, He, Zengyou

论文摘要

尽管已经提出了许多算法来解决分类数据聚类问题，但如何访问一组分类簇的统计显着性仍然未解决。为了实现这一空白，我们采用了似然比测试来得出一个测试统计量，该测试统计量可以用作分类数据聚类中的重要性目标函数。因此，提出了一种新的聚类算法，其中通过蒙特卡洛搜索程序优化了基于显着性的目标函数。作为副产品，我们可以进一步计算经验$ p $ - 价值，以评估一组集群的统计意义，并开发出改进的间隙统计量来估计群集数。广泛的实验研究表明，我们的方法能够达到与最先进的分类数据聚类算法相当的性能。此外，通过全面的经验结果证明了这种基于显着性的公式对统计群集验证和群集数估计的有效性。

Although numerous algorithms have been proposed to solve the categorical data clustering problem, how to access the statistical significance of a set of categorical clusters remains unaddressed. To fulfill this void, we employ the likelihood ratio test to derive a test statistic that can serve as a significance-based objective function in categorical data clustering. Consequently, a new clustering algorithm is proposed in which the significance-based objective function is optimized via a Monte Carlo search procedure. As a by-product, we can further calculate an empirical $p$-value to assess the statistical significance of a set of clusters and develop an improved gap statistic for estimating the cluster number. Extensive experimental studies suggest that our method is able to achieve comparable performance to state-of-the-art categorical data clustering algorithms. Moreover, the effectiveness of such a significance-based formulation on statistical cluster validation and cluster number estimation is demonstrated through comprehensive empirical results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题