Grit-dbscan：非常大的数据库的空间聚类算法

论文标题

Grit-dbscan：非常大的数据库的空间聚类算法

GriT-DBSCAN: A Spatial Clustering Algorithm for Very Large Databases

论文作者

Huang, Xiaogang, Ma, Tiefeng, Liu, Conan, Liu, Shuangzhe

论文摘要

DBSCAN是具有许多实际应用的基本空间聚类算法。但是，在最坏的情况下，算法的瓶颈是$ O（n^2）$。为了解决这一限制，我们提出了一种基于网格的算法，用于欧几里得空间中的精确dbscan，称为grit-dbscan，该算法基于以下两种技术。首先，我们引入了一棵网格树来组织非空网格，以进行有效的非相邻网格查询。其次，通过利用点之间的空间关系，我们提出了一种技术，即在确定两组之间的最小距离是小于还是等于一定阈值时，可以迭代地修剪不必要的距离计算。从理论上讲，我们证明了Grit-DBSCan的复杂性与数据集大小是线性的。此外，我们通过结合启发式方法或将第二种技术与现有算法结合使用，从而获得了两个变体的粒度-DBSCAN。实验是在合成和现实世界数据集上进行的，以评估粒度-DBSCAN及其变体的效率。我们的分析结果表明，我们的算法的表现优于现有算法。

DBSCAN is a fundamental spatial clustering algorithm with numerous practical applications. However, a bottleneck of the algorithm is in the worst case, the run time complexity is $O(n^2)$. To address this limitation, we propose a new grid-based algorithm for exact DBSCAN in Euclidean space called GriT-DBSCAN, which is based on the following two techniques. First, we introduce a grid tree to organize the non-empty grids for the purpose of efficient non-empty neighboring grids queries. Second, by utilising the spatial relationships among points, we propose a technique that iteratively prunes unnecessary distance calculations when determining whether the minimum distance between two sets is less than or equal to a certain threshold. We theoretically prove that the complexity of GriT-DBSCAN is linear to the data set size. In addition, we obtain two variants of GriT-DBSCAN by incorporating heuristics, or by combining the second technique with an existing algorithm. Experiments are conducted on both synthetic and real-world data sets to evaluate the efficiency of GriT-DBSCAN and its variants. The results of our analyses show that our algorithms outperform existing algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题