可扩展有效的基于电导的图形群集

论文标题

可扩展有效的基于电导的图形群集

Scalable and Effective Conductance-based Graph Clustering

论文作者

Lin, Longlong, Li, Rong-Hua, Jia, Tao

论文摘要

在众多图形分析应用中，基于电导的图簇被认为是基本操作员。尽管基于电导的图聚类取得了重大成功，但现有的算法要么很难获得令人满意的聚类质量，要么具有很高的时间和空间复杂性以获得可证明的聚类质量。为了克服这些局限性，我们设计了一个强大的\ textit {peeling}基于基于图形群集框架\ textit {pcon}。我们表明，许多现有解决方案可以简化为我们的框架。也就是说，他们首先为每个顶点定义一个分数函数，然后以最小的分数迭代删除顶点。最后，他们在剥离过程中以最小的电导率输出结果。基于我们的框架，我们提出了两种新颖的算法\ textit {pcon \ _core}和\ emph {pcon \ _de}，具有线性时间和空间复杂性，它们可以从大量图中有高效，有效地识别出数十亿多个边缘的大图。令人惊讶的是，我们证明\ emph {pcon \ _de}可以识别具有接近恒定近似比的簇，从而对众所周知的二次二次cheeger绑定产生了重要的理论改进。现实生活和合成数据集的经验结果表明，我们的算法可以以高群集精度达到5 $ \ sim $ 42倍的加速$ 42倍，而使用1.4 $ \ sim $ \ sim $ 7.8倍的内存比基线算法少倍。

Conductance-based graph clustering has been recognized as a fundamental operator in numerous graph analysis applications. Despite the significant success of conductance-based graph clustering, existing algorithms are either hard to obtain satisfactory clustering qualities, or have high time and space complexity to achieve provable clustering qualities. To overcome these limitations, we devise a powerful \textit{peeling}-based graph clustering framework \textit{PCon}. We show that many existing solutions can be reduced to our framework. Namely, they first define a score function for each vertex, then iteratively remove the vertex with the smallest score. Finally, they output the result with the smallest conductance during the peeling process. Based on our framework, we propose two novel algorithms \textit{PCon\_core} and \emph{PCon\_de} with linear time and space complexity, which can efficiently and effectively identify clusters from massive graphs with more than a few billion edges. Surprisingly, we prove that \emph{PCon\_de} can identify clusters with near-constant approximation ratio, resulting in an important theoretical improvement over the well-known quadratic Cheeger bound. Empirical results on real-life and synthetic datasets show that our algorithms can achieve 5$\sim$42 times speedup with a high clustering accuracy, while using 1.4$\sim$7.8 times less memory than the baseline algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题