论文标题
可解释的$ k $ -means和$ k $ -Medians聚集
Explainable $k$-Means and $k$-Medians Clustering
论文作者
论文摘要
聚类是几何数据无监督学习的流行形式。不幸的是,许多聚类算法导致了很难解释的集群分配,部分是因为它们以复杂的方式取决于数据的所有特征。为了提高可解释性,我们考虑使用一个小决策树将数据集划分为群集,以便可以直接表征簇。我们从理论上的角度研究了这个问题,通过$ k $ -Means和$ k $ -Medians的目标来测量群集质量:必须存在一个树引起的聚类,其成本与最佳无约束的集群相当,如果是这样,如何找到它?在负面结果方面,我们首先表明,流行的自上而下的决策树算法可能会导致群集任意成本较大,其次,与最佳聚类相比,任何树诱导的聚类通常都必须产生$ω(\ log k)$近似因子。从积极的一面来看,我们设计了一种有效的算法,该算法使用$ k $叶子的树生成可解释的簇。对于两个均值/中位数,我们表明单个阈值削减足以实现恒定因子近似,并且几乎匹配的下限。对于一般$ k \ geq 2 $,我们的算法是$ o(k)$ $近似于最佳$ k $ -Medians和$ o(k^2)$(k^2)$近似值。在我们工作之前,没有算法以可证明的保证,而不是尺寸和输入大小。
Clustering is a popular form of unsupervised learning for geometric data. Unfortunately, many clustering algorithms lead to cluster assignments that are hard to explain, partially because they depend on all the features of the data in a complicated way. To improve interpretability, we consider using a small decision tree to partition a data set into clusters, so that clusters can be characterized in a straightforward manner. We study this problem from a theoretical viewpoint, measuring cluster quality by the $k$-means and $k$-medians objectives: Must there exist a tree-induced clustering whose cost is comparable to that of the best unconstrained clustering, and if so, how can it be found? In terms of negative results, we show, first, that popular top-down decision tree algorithms may lead to clusterings with arbitrarily large cost, and second, that any tree-induced clustering must in general incur an $Ω(\log k)$ approximation factor compared to the optimal clustering. On the positive side, we design an efficient algorithm that produces explainable clusters using a tree with $k$ leaves. For two means/medians, we show that a single threshold cut suffices to achieve a constant factor approximation, and we give nearly-matching lower bounds. For general $k \geq 2$, our algorithm is an $O(k)$ approximation to the optimal $k$-medians and an $O(k^2)$ approximation to the optimal $k$-means. Prior to our work, no algorithms were known with provable guarantees independent of dimension and input size.