论文标题
基于锤距离的分类数据基于模型的聚类
Model-based clustering of categorical data based on the Hamming distance
论文作者
论文摘要
开发了一种基于模型的方法,用于群集分类数据,没有自然排序。提出的方法利用了锤距离来定义概率质量函数家族以建模数据。然后,该家族的元素被视为有限混合模型的内核,该模型具有未知数的组件。 共轭贝叶斯推断已得出了锤子分布模型的参数。该混合物在贝叶斯非参数设置中构架,并开发了跨二维的阻塞Gibbs采样器,以提供有关簇的数量,它们的结构和特定于组的参数,从而促进了相对于习惯可逆跳跃算法的计算。提出的模型将固定的潜在类模型涵盖了固定组件数量时的特殊情况。通过模拟研究和参考数据集评估模型性能,显示了聚类恢复对现有方法的改进。
A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components. Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches.