基于锤距离的分类数据基于模型的聚类

论文标题

基于锤距离的分类数据基于模型的聚类

Model-based clustering of categorical data based on the Hamming distance

论文作者

Argiento, Raffaele, Filippi-Mazzola, Edoardo, Paci, Lucia

论文摘要

开发了一种基于模型的方法，用于群集分类数据，没有自然排序。提出的方法利用了锤距离来定义概率质量函数家族以建模数据。然后，该家族的元素被视为有限混合模型的内核，该模型具有未知数的组件。共轭贝叶斯推断已得出了锤子分布模型的参数。该混合物在贝叶斯非参数设置中构架，并开发了跨二维的阻塞Gibbs采样器，以提供有关簇的数量，它们的结构和特定于组的参数，从而促进了相对于习惯可逆跳跃算法的计算。提出的模型将固定的潜在类模型涵盖了固定组件数量时的特殊情况。通过模拟研究和参考数据集评估模型性能，显示了聚类恢复对现有方法的改进。

A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components. Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题