论文标题
微集群任务的随机分区模型
Random Partition Models for Microclustering Tasks
论文作者
论文摘要
传统的贝叶斯随机分区模型假设每个群集的大小随数据点的数量线性增长。尽管这对某些应用程序具有吸引力,但此假设不适用于其他任务,例如实体分辨率,稀疏网络建模和DNA测序任务。这些应用程序需要模型,这些模型会产生群集,其大小在数据点的总数(微量集群属性)上生长。在这些问题的推动下,我们提出了一个一般的随机分区模型,这些模型满足了具有良好特征的理论特性的微聚类属性。我们提出的模型克服了现有文献对微簇模型的主要局限性,即缺乏可解释性,可识别性和模型渐近性能的全面表征。至关重要的是,我们放弃了具有可交换的数据点序列的经典假设,而是假定群集的可交换序列。此外,我们的框架从集群大小的先前分布,计算障碍性以及适用于大量微关注任务的适用性方面提供了灵活性。我们建立了由此产生的先验类别的理论特性,在其中表征了簇数的渐近行为以及给定大小的簇的比例。我们的框架允许简单有效的马尔可夫链蒙特卡洛算法执行统计推断。我们说明了有关实体解决的微聚类任务的建议方法,我们在其中提供了模拟研究和对调查小组数据的真实实验。
Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution, modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yield clusters whose sizes grow sublinearly with the total number of data points -- the microclustering property. Motivated by these issues, we propose a general class of random partition models that satisfy the microclustering property with well-characterized theoretical properties. Our proposed models overcome major limitations in the existing literature on microclustering models, namely a lack of interpretability, identifiability, and full characterization of model asymptotic properties. Crucially, we drop the classical assumption of having an exchangeable sequence of data points, and instead assume an exchangeable sequence of clusters. In addition, our framework provides flexibility in terms of the prior distribution of cluster sizes, computational tractability, and applicability to a large number of microclustering tasks. We establish theoretical properties of the resulting class of priors, where we characterize the asymptotic behavior of the number of clusters and of the proportion of clusters of a given size. Our framework allows a simple and efficient Markov chain Monte Carlo algorithm to perform statistical inference. We illustrate our proposed methodology on the microclustering task of entity resolution, where we provide a simulation study and real experiments on survey panel data.