论文标题
分层聚类与离散潜在变量模型和集成分类的可能性
Hierarchical clustering with discrete latent variable models and the integrated classification likelihood
论文作者
论文摘要
找到数据集的一组嵌套分区对于在不同尺度上发现相关的结构很有用,并且经常处理与数据有关的方法。在本文中,我们为基于模型的分层聚类引入了一般的两步方法。考虑到集成的分类可能性标准作为目标函数,这项工作适用于该数量可进行的每个离散潜在变量模型(DLVM)。该方法的第一步涉及最大化有关分区的标准。在解决贪婪的山坡攀岩启发式方法中发现的次优局部最大最大最大的问题时,我们基于一种基于遗传算法有效探索解决方案空间的遗传算法引入了一种新的混合算法。由此产生的算法仔细结合并合并了不同的解决方案,并允许将簇的数量$ k $以及簇本身的共同推断。从这个自然分区开始,该方法的第二步是基于自下而上的贪婪程序来提取簇的层次结构。在贝叶斯语境中,这是通过考虑dirichlet群集比例的先验参数$α$作为控制聚类粒度的正则化项来实现的。标准的新近似是作为$α$的对数线性函数得出的,从而实现了合并决策标准的简单功能形式。第二步允许在更粗的尺度上探索聚类。将所提出的方法与现有的模拟和实际设置的策略进行了比较,结果表明其结果特别相关。该工作的参考实现可在纸张附带的r软件包贪婪中获得。
Finding a set of nested partitions of a dataset is useful to uncover relevant structure at different scales, and is often dealt with a data-dependent methodology. In this paper, we introduce a general two-step methodology for model-based hierarchical clustering. Considering the integrated classification likelihood criterion as an objective function, this work applies to every discrete latent variable models (DLVMs) where this quantity is tractable. The first step of the methodology involves maximizing the criterion with respect to the partition. Addressing the known problem of sub-optimal local maxima found by greedy hill climbing heuristics, we introduce a new hybrid algorithm based on a genetic algorithm efficiently exploring the space of solutions. The resulting algorithm carefully combines and merges different solutions, and allows the joint inference of the number $K$ of clusters as well as the clusters themselves. Starting from this natural partition, the second step of the methodology is based on a bottom-up greedy procedure to extract a hierarchy of clusters. In a Bayesian context, this is achieved by considering the Dirichlet cluster proportion prior parameter $α$ as a regularization term controlling the granularity of the clustering. A new approximation of the criterion is derived as a log-linear function of $α$, enabling a simple functional form of the merge decision criterion. This second step allows the exploration of the clustering at coarser scales. The proposed approach is compared with existing strategies on simulated as well as real settings, and its results are shown to be particularly relevant. A reference implementation of this work is available in the R package greed accompanying the paper.