用于分层集聚聚类的公平算法

论文标题

用于分层集聚聚类的公平算法

Fair Algorithms for Hierarchical Agglomerative Clustering

论文作者

Chhabra, Anshuman, Mohapatra, Prasant

论文摘要

层次结构聚类（HAC）算法在现代数据科学中广泛使用，并寻求将数据集划分为簇，同时在数据示例之间产生层次结构关系。 HAC算法用于许多应用中，例如生物学，自然语言处理和推荐系统。因此，必须确保这些算法是公平的 - 即使数据集包含针对某些受保护组的偏差，生成的群集输出也不应区分这些组的样本。但是，最近在聚类公平方面的工作主要集中在基于中心的聚类算法上，例如K-Median和K-Means聚类。在本文中，我们提出了执行HAC的公平算法，该算法执行公平性约束1）不论所使用的距离连锁标准如何，2）概括为HAC的任何自然度量，3）对多个保护组的工作，以及4）4）具有竞争性的跑步时间。通过对多个现实世界UCI数据集进行的广泛实验，我们表明我们提出的算法发现与香草HAC以及其他最先进的公平聚类方法相比，聚类更公平。

Hierarchical Agglomerative Clustering (HAC) algorithms are extensively utilized in modern data science, and seek to partition the dataset into clusters while generating a hierarchical relationship between the data samples. HAC algorithms are employed in many applications, such as biology, natural language processing, and recommender systems. Thus, it is imperative to ensure that these algorithms are fair -- even if the dataset contains biases against certain protected groups, the cluster outputs generated should not discriminate against samples from any of these groups. However, recent work in clustering fairness has mostly focused on center-based clustering algorithms, such as k-median and k-means clustering. In this paper, we propose fair algorithms for performing HAC that enforce fairness constraints 1) irrespective of the distance linkage criteria used, 2) generalize to any natural measures of clustering fairness for HAC, 3) work for multiple protected groups, and 4) have competitive running times to vanilla HAC. Through extensive experiments on multiple real-world UCI datasets, we show that our proposed algorithm finds fairer clusterings compared to vanilla HAC as well as other state-of-the-art fair clustering approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题