论文标题

隔离内核对聚集层次聚类算法的影响

The Impact of Isolation Kernel on Agglomerative Hierarchical Clustering Algorithms

论文作者

Han, Xin, Zhu, Ye, Ting, Kai Ming, Li, Gang

论文摘要

聚集层次聚类(AHC)是流行的聚类方法之一。基于距离度量的现有AHC方法有一个关键问题:无论在所得树状图上应用的群集提取方法如何,它都难以识别具有多样密度的相邻簇。在本文中,我们确定了此问题的根本原因,并表明使用数据依赖性内核(代替距离或现有内核)提供了一种有效的方法来解决它。我们分析了现有AHC方法无法有效提取簇的条件;以及与数据相关的内核是一种有效的补救措施的原因。这导致了一种新的方法,用于Kernerlise现有的层次聚类算法,例如现有的传统AHC算法,HDBSCAN,GDL和PHA。在这些算法中的每一种中,我们的经验评估表明,最近引入的隔离内核比距离,高斯内核和自适应高斯内核产生的质量更高或纯净的树状图。

Agglomerative hierarchical clustering (AHC) is one of the popular clustering approaches. Existing AHC methods, which are based on a distance measure, have one key issue: it has difficulty in identifying adjacent clusters with varied densities, regardless of the cluster extraction methods applied on the resultant dendrogram. In this paper, we identify the root cause of this issue and show that the use of a data-dependent kernel (instead of distance or existing kernel) provides an effective means to address it. We analyse the condition under which existing AHC methods fail to extract clusters effectively; and the reason why the data-dependent kernel is an effective remedy. This leads to a new approach to kernerlise existing hierarchical clustering algorithms such as existing traditional AHC algorithms, HDBSCAN, GDL and PHA. In each of these algorithms, our empirical evaluation shows that a recently introduced Isolation Kernel produces a higher quality or purer dendrogram than distance, Gaussian Kernel and adaptive Gaussian Kernel.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源