用于聚类正态分布数据的贝叶斯信息标准

论文标题

用于聚类正态分布数据的贝叶斯信息标准

Bayesian information criteria for clustering normally distributed data

论文作者

Webster, Anthony J.

论文摘要

最大似然估计值（MLE）是渐近的正态分布，该特性用于荟萃分析中，用于测试单个群集或几个亚组的估计值的异质性。最近，在层次上，已将危险因素与疾病之间关联的MLE聚集以搜索具有共同基本原因的疾病，但是需要一个客观的统计标准来确定集群的数量和组成。为了解决此问题，在考虑将数据分配到群集中的后验分布之前，对常规统计检验进行了简要审查。后验分布是通过将未知聚类中心边缘化而计算的，并且与混合模型相关的可能性不同。该计算等同于用于获得贝叶斯信息标准（BIC）的计算，但确切，没有拉普拉斯近似。结果包括一个正方形术语和取决于群集的数量和组成的术语，这些术语惩罚了模型中的自由参数数量。通常，通常的BIC不适合聚类应用程序，除非每个群集中的项目数量足够大。

Maximum likelihood estimates (MLEs) are asymptotically normally distributed, and this property is used in meta-analyses to test the heterogeneity of estimates, either for a single cluster or for several sub-groups. More recently, MLEs for associations between risk factors and diseases have been hierarchically clustered to search for diseases with shared underlying causes, but an objective statistical criterion is needed to determine the number and composition of clusters. To tackle this problem, conventional statistical tests are briefly reviewed, before considering the posterior distribution for a partition of data into clusters. The posterior distribution is calculated by marginalising out the unknown cluster centres, and is different to the likelihood associated with mixture models. The calculation is equivalent to that used to obtain the Bayesian Information Criterion (BIC), but is exact, without a Laplace approximation. The result includes a sum of squares term, and terms that depend on the number and composition of clusters, that penalise the number of free parameters in the model. The usual BIC is shown to be unsuitable for clustering applications unless the number of items in each individual cluster is sufficiently large.

下载PDF全文

下载文献需遵守相关版权规定

论文标题