我如何学会停止担心和热爱维度的诅咒：对高维空间中聚类验证的评估

论文标题

我如何学会停止担心和热爱维度的诅咒：对高维空间中聚类验证的评估

How I learned to stop worrying and love the curse of dimensionality: an appraisal of cluster validation in high-dimensional spaces

论文作者

Powell, Brian A.

论文摘要

欧几里得规范在高维空间中可靠区分附近和遥远点的失败是众所周知的。距离浓度的这种现象在多种数据分布中表现出具有IID或相关特征，包括中央分布和聚类的数据。因此，基于欧几里得最近的邻居和更通用的接近性数据挖掘任务（例如聚类）的无监督学习可能会受到高维应用的距离浓度的不利影响。尽管已经完成了相当大的工作，但已经开发了具有可靠的高维度性能的聚类算法，但群集验证的问题 - 确定数据集中的自然簇数 - 在高维问题中未经仔细检查。在这项工作中，我们研究了对各种合成数据方案（包括良好的分离和嘈杂簇）的基于欧几里得规范的集群有效性指数尺度的敏感性，并发现高维度的绝大多数索引提高了或稳定的敏感性。因此，对于这类相当通用的数据方案而言，维度的诅咒被消除了。

The failure of the Euclidean norm to reliably distinguish between nearby and distant points in high dimensional space is well-known. This phenomenon of distance concentration manifests in a variety of data distributions, with iid or correlated features, including centrally-distributed and clustered data. Unsupervised learning based on Euclidean nearest-neighbors and more general proximity-oriented data mining tasks like clustering, might therefore be adversely affected by distance concentration for high-dimensional applications. While considerable work has been done developing clustering algorithms with reliable high-dimensional performance, the problem of cluster validation--of determining the natural number of clusters in a dataset--has not been carefully examined in high-dimensional problems. In this work we investigate how the sensitivities of common Euclidean norm-based cluster validity indices scale with dimension for a variety of synthetic data schemes, including well-separated and noisy clusters, and find that the overwhelming majority of indices have improved or stable sensitivity in high dimensions. The curse of dimensionality is therefore dispelled for this class of fairly generic data schemes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题