论文标题

Grit-dbscan:非常大的数据库的空间聚类算法

GriT-DBSCAN: A Spatial Clustering Algorithm for Very Large Databases

论文作者

Huang, Xiaogang, Ma, Tiefeng, Liu, Conan, Liu, Shuangzhe

论文摘要

DBSCAN是具有许多实际应用的基本空间聚类算法。但是,在最坏的情况下,算法的瓶颈是$ O(n^2)$。为了解决这一限制,我们提出了一种基于网格的算法,用于欧几里得空间中的精确dbscan,称为grit-dbscan,该算法基于以下两种技术。首先,我们引入了一棵网格树来组织非空网格,以进行有效的非相邻网格查询。其次,通过利用点之间的空间关系,我们提出了一种技术,即在确定两组之间的最小距离是小于还是等于一定阈值时,可以迭代地修剪不必要的距离计算。从理论上讲,我们证明了Grit-DBSCan的复杂性与数据集大小是线性的。此外,我们通过结合启发式方法或将第二种技术与现有算法结合使用,从而获得了两个变体的粒度-DBSCAN。实验是在合成和现实世界数据集上进行的,以评估粒度-DBSCAN及其变体的效率。我们的分析结果表明,我们的算法的表现优于现有算法。

DBSCAN is a fundamental spatial clustering algorithm with numerous practical applications. However, a bottleneck of the algorithm is in the worst case, the run time complexity is $O(n^2)$. To address this limitation, we propose a new grid-based algorithm for exact DBSCAN in Euclidean space called GriT-DBSCAN, which is based on the following two techniques. First, we introduce a grid tree to organize the non-empty grids for the purpose of efficient non-empty neighboring grids queries. Second, by utilising the spatial relationships among points, we propose a technique that iteratively prunes unnecessary distance calculations when determining whether the minimum distance between two sets is less than or equal to a certain threshold. We theoretically prove that the complexity of GriT-DBSCAN is linear to the data set size. In addition, we obtain two variants of GriT-DBSCAN by incorporating heuristics, or by combining the second technique with an existing algorithm. Experiments are conducted on both synthetic and real-world data sets to evaluate the efficiency of GriT-DBSCAN and its variants. The results of our analyses show that our algorithms outperform existing algorithms.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源