论文标题

HyperAID:在双曲线空间中降级,用于树木拟合和分层聚类

HyperAid: Denoising in hyperbolic spaces for tree-fitting and hierarchical clustering

论文作者

Chien, Eli, Tabaghi, Puoya, Milenkovic, Olgica

论文摘要

由于自然语言处理,系统发育,癌症基因组学和无数涉及层次结构的问题领域的应用,因此,树木成型的距离问题在理论计算机科学和机器学习社区中都受到了极大的关注。尽管存在几种可证明的精确算法,这些算法是固有地遵守树 - 件限制的数据的树木拟合,但对于如何最佳地拟合树 - 量表的数据,其结构中适度(或实质上)与树的差异的数据却少得多。对于这种嘈杂的数据,大多数可用的算法的性能较差,并且通常在代表树中产生负边缘。此外,目前尚不知道如何选择最合适的近似目标来进行嘈杂。我们的贡献如下。首先,我们在双曲空间中提出了一种新的树木 - 金属denoising(HyperAID)方法,该方法将原始数据转换为类似于``More''树的数据,当根据Gromov的$δ$ Hyprololicity评估时。其次,我们进行了一项消融研究,涉及两个选择近似目标的选择,$ \ ell_p $ norms和dasgupta损失。第三,我们将HyperAID与实施非负边缘的方案集成在一起。结果,HyperAID平台的表现优于文献中的所有其他现有方法,包括邻居加入(NJ),Treerep和T-Rex,包括合成和现实世界数据。综合数据由边缘增强的树和最短距离指标表示,而实际数据集包括动物园,虹膜,玻璃,分割和spambase;在这些数据集上,关于NJ的平均改进为$ 125.94 \%$。

The problem of fitting distances by tree-metrics has received significant attention in the theoretical computer science and machine learning communities alike, due to many applications in natural language processing, phylogeny, cancer genomics and a myriad of problem areas that involve hierarchical clustering. Despite the existence of several provably exact algorithms for tree-metric fitting of data that inherently obeys tree-metric constraints, much less is known about how to best fit tree-metrics for data whose structure moderately (or substantially) differs from a tree. For such noisy data, most available algorithms perform poorly and often produce negative edge weights in representative trees. Furthermore, it is currently not known how to choose the most suitable approximation objective for noisy fitting. Our contributions are as follows. First, we propose a new approach to tree-metric denoising (HyperAid) in hyperbolic spaces which transforms the original data into data that is ``more'' tree-like, when evaluated in terms of Gromov's $δ$ hyperbolicity. Second, we perform an ablation study involving two choices for the approximation objective, $\ell_p$ norms and the Dasgupta loss. Third, we integrate HyperAid with schemes for enforcing nonnegative edge-weights. As a result, the HyperAid platform outperforms all other existing methods in the literature, including Neighbor Joining (NJ), TreeRep and T-REX, both on synthetic and real-world data. Synthetic data is represented by edge-augmented trees and shortest-distance metrics while the real-world datasets include Zoo, Iris, Glass, Segmentation and SpamBase; on these datasets, the average improvement with respect to NJ is $125.94\%$.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源