有效的动态聚类：从历史群集演变中捕获模式

论文标题

有效的动态聚类：从历史群集演变中捕获模式

Efficient Dynamic Clustering: Capturing Patterns from Historical Cluster Evolution

论文作者

Gu, Binbin, Kargar, Saeed, Nawab, Faisal

论文摘要

聚类的目的是基于基于群集中固有的相似性分组未标记的对象。对于许多任务，例如异常检测，数据库碎片，记录链接和其他任务非常重要。某些聚类方法被视为批处理算法，这些算法在从头开始将数据库中的所有对象聚集或假设一个增量工作负载时会产生高开销。实际上，数据库对象不断地从数据库中更新，添加和删除，这使得先前的结果过时。在这种情况下，运行批处理算法是不可行的，因为如果连续执行，它会产生大量的开销。对于高速场景，例如物联网应用程序中的一个情况，情况尤其如此。在本文中，我们解决了在高速动态方案中聚类的问题，在该方案中，对象被连续更新，插入和删除。具体而言，我们提出了一种通常动态的聚类方法，该方法利用了以前的聚类结果。我们的系统Dynamicc使用了使用现有批处理算法增强的机器学习模型。动态模型通过观察批次算法做出的聚类决策来训练。训练后，Dynamicc模型与批处理算法合作使用，以实现准确和快速的聚类决策。在四个现实世界和一个合成数据集上的实验结果表明，与最新方法相比，我们的方法具有更好的性能，同时与基线批处理算法相似的准确聚类结果。

Clustering aims to group unlabeled objects based on similarity inherent among them into clusters. It is important for many tasks such as anomaly detection, database sharding, record linkage, and others. Some clustering methods are taken as batch algorithms that incur a high overhead as they cluster all the objects in the database from scratch or assume an incremental workload. In practice, database objects are updated, added, and removed from databases continuously which makes previous results stale. Running batch algorithms is infeasible in such scenarios as it would incur a significant overhead if performed continuously. This is particularly the case for high-velocity scenarios such as ones in Internet of Things applications. In this paper, we tackle the problem of clustering in high-velocity dynamic scenarios, where the objects are continuously updated, inserted, and deleted. Specifically, we propose a generally dynamic approach to clustering that utilizes previous clustering results. Our system, DynamicC, uses a machine learning model that is augmented with an existing batch algorithm. The DynamicC model trains by observing the clustering decisions made by the batch algorithm. After training, the DynamicC model is usedin cooperation with the batch algorithm to achieve both accurate and fast clustering decisions. The experimental results on four real-world and one synthetic datasets show that our approach has a better performance compared to the state-of-the-art method while achieving similarly accurate clustering results to the baseline batch algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题