论文标题
secleds:通过多个MEDOID和MEDOID投票在不断发展的数据流中序列聚类
SECLEDS: Sequence Clustering in Evolving Data Streams via Multiple Medoids and Medoid Voting
论文作者
论文摘要
流媒体环境中的序列聚类是具有挑战性的,因为它在计算上很昂贵,并且序列可能会随着时间的流逝而发展。 K-Medoids或围绕MEDOIDS(PAM)进行分区(PAM)通常用于聚类序列,因为它支持基于对齐的距离,而K-Centers为实际数据项有助于群集可解释性。但是,离线K-Medoids对概念漂移并不支持,同时对于聚类数据流的昂贵也很昂贵。因此,我们提出了Secrets,这是具有恒定内存足迹的K-Medoids算法的流媒体变体。 secleds具有两个独特的属性:i)每个群集使用多个MEDOID,产生稳定的高质量簇,ii)它使用直观的Medioid投票方案来处理概念漂移,以近似群集距离。与现有的自适应算法为新概念创建新簇的算法不同,secleds遵循一种根本不同的方法,在这种方法中,簇本身随着不断发展的流而发展。使用真实和合成数据集,我们从经验上证明,不管漂移,流尺寸,数据维度和簇的数量如何,secleds会产生高质量的集群。我们将三种流行的流和批处理聚类算法进行比较。最先进的Banditpam用作离线基准测试。 Secleds可以达到可比的F1分数与Banditpam,同时将所需距离计算的数量减少了83.7%。重要的是,当溪流含有漂移时,Secrets的表现优于所有基准。我们还聚集了真正的网络流量,并提供证据表明,secleds可以在使用(昂贵的)动态时间翘曲距离的同时支持高达1.08 Gbps的网络带宽。
Sequence clustering in a streaming environment is challenging because it is computationally expensive, and the sequences may evolve over time. K-medoids or Partitioning Around Medoids (PAM) is commonly used to cluster sequences since it supports alignment-based distances, and the k-centers being actual data items helps with cluster interpretability. However, offline k-medoids has no support for concept drift, while also being prohibitively expensive for clustering data streams. We therefore propose SECLEDS, a streaming variant of the k-medoids algorithm with constant memory footprint. SECLEDS has two unique properties: i) it uses multiple medoids per cluster, producing stable high-quality clusters, and ii) it handles concept drift using an intuitive Medoid Voting scheme for approximating cluster distances. Unlike existing adaptive algorithms that create new clusters for new concepts, SECLEDS follows a fundamentally different approach, where the clusters themselves evolve with an evolving stream. Using real and synthetic datasets, we empirically demonstrate that SECLEDS produces high-quality clusters regardless of drift, stream size, data dimensionality, and number of clusters. We compare against three popular stream and batch clustering algorithms. The state-of-the-art BanditPAM is used as an offline benchmark. SECLEDS achieves comparable F1 score to BanditPAM while reducing the number of required distance computations by 83.7%. Importantly, SECLEDS outperforms all baselines by 138.7% when the stream contains drift. We also cluster real network traffic, and provide evidence that SECLEDS can support network bandwidths of up to 1.08 Gbps while using the (expensive) dynamic time warping distance.