论文标题
顺序数据中的簇的检测和评估
Detection and Evaluation of Clusters within Sequential Data
论文作者
论文摘要
在降低维度降低技术方面的理论进步中,我们使用了一个名为Block Markov链的最新模型,以对真实世界顺序数据中的聚类进行实际研究。块马尔可夫链的聚类算法具有理论最佳保证,并且可以在稀疏的数据制度中部署。尽管存在这些有利的理论特性,但仍缺乏对现实环境中这些算法的彻底评估。 我们解决了这个问题,并研究了这些聚类算法在现实世界顺序数据的探索性数据分析中的适用性。特别是,我们的顺序数据来自人类DNA,书面文本,动物运动数据和金融市场。为了评估确定的群集和关联的块马尔可夫链模型,我们进一步开发了一组评估工具。这些工具包括基准测试,光谱噪声分析和统计模型选择工具。与本文一起提供了聚类算法和新评估工具的有效实现。 遇到和讨论与现实数据相关的实际挑战。最终发现,尽管实际数据的复杂性和稀疏性,但Block Markov链模型假设与此处开发的工具确实可以在探索性数据分析中产生有意义的见解。
Motivated by theoretical advancements in dimensionality reduction techniques we use a recent model, called Block Markov Chains, to conduct a practical study of clustering in real-world sequential data. Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees and can be deployed in sparse data regimes. Despite these favorable theoretical properties, a thorough evaluation of these algorithms in realistic settings has been lacking. We address this issue and investigate the suitability of these clustering algorithms in exploratory data analysis of real-world sequential data. In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets. In order to evaluate the determined clusters, and the associated Block Markov Chain model, we further develop a set of evaluation tools. These tools include benchmarking, spectral noise analysis and statistical model selection tools. An efficient implementation of the clustering algorithm and the new evaluation tools is made available together with this paper. Practical challenges associated to real-world data are encountered and discussed. It is ultimately found that the Block Markov Chain model assumption, together with the tools developed here, can indeed produce meaningful insights in exploratory data analyses despite the complexity and sparsity of real-world data.