在贝叶斯群集分析中的数据簇数量和分区分布中进行监视

论文标题

在贝叶斯群集分析中的数据簇数量和分区分布中进行监视

Spying on the prior of the number of data clusters and the partition distribution in Bayesian cluster analysis

论文作者

Greve, Jan, Grün, Bettina, Malsiner-Walli, Gertraud, Frühwirth-Schnatter, Sylvia

论文摘要

聚类分析旨在将数据分为组或集群。在应用程序中，通常会处理群集数量未知的问题。在此类应用中采用的贝叶斯混合模型通常会指定一个灵活的先验，该模型考虑了簇数的不确定性。但是，涉及使用这些模型的主要经验挑战是在分区上的先验表征。这项工作介绍了一种在贝叶斯有限混合物和贝叶斯非参数方面开发的三种选定贝叶斯混合模型的分区中先验的描述性统计方法。所提出的方法涉及在样本中的群集数量（称为``数据簇''）对先验的计算有效枚举，并确定表征分区的对称加法统计的前两个先前矩。随附的参考实现可在“ FIPP”软件包中提供。最后，我们通过比较说明了提出的方法，并讨论了对应用程序中先前启发的影响。

Cluster analysis aims at partitioning data into groups or clusters. In applications, it is common to deal with problems where the number of clusters is unknown. Bayesian mixture models employed in such applications usually specify a flexible prior that takes into account the uncertainty with respect to the number of clusters. However, a major empirical challenge involving the use of these models is in the characterisation of the induced prior on the partitions. This work introduces an approach to compute descriptive statistics of the prior on the partitions for three selected Bayesian mixture models developed in the areas of Bayesian finite mixtures and Bayesian nonparametrics. The proposed methodology involves computationally efficient enumeration of the prior on the number of clusters in-sample (termed as ``data clusters'') and determining the first two prior moments of symmetric additive statistics characterising the partitions. The accompanying reference implementation is made available in the R package 'fipp'. Finally, we illustrate the proposed methodology through comparisons and also discuss the implications for prior elicitation in applications.

下载PDF全文

下载文献需遵守相关版权规定

论文标题