论文标题
使用张量分解学习多元CDF和COPULAS
Learning Multivariate CDFs and Copulas using Tensor Factorization
论文作者
论文摘要
学习数据的多元分布是统计和机器学习中的核心挑战。传统方法的目的是概率密度函数(PDF),并受到维数的诅咒的限制。现代神经方法主要基于黑盒模型,缺乏可识别性保证。在这项工作中,我们旨在学习多元累积分布功能(CDF),因为它们可以处理混合的随机变量,允许有效的盒子概率评估,并有可能由于其累积性而克服本地样本稀缺性。我们表明,混合随机变量的联合CDF的任何网格采样版本都可以通过规范多核(张量)分解为幼稚的贝叶斯模型。通过直接在原始数据域中引入低级模型,或在转换的(copula)域间接引入,所得模型可提供有效的采样,封闭形式的推理和不确定性定量,并在相对温和的条件下提供独特的保证。我们证明了在几个合成和真实数据集和应用程序中提出的模型的出色性能,包括回归,采样和数据插补。有趣的是,我们对真实数据的实验表明,与低级PDF/PMF模型相比,可以通过低级别CDF模型间接获得更好的密度/质量估计。
Learning the multivariate distribution of data is a core challenge in statistics and machine learning. Traditional methods aim for the probability density function (PDF) and are limited by the curse of dimensionality. Modern neural methods are mostly based on black-box models, lacking identifiability guarantees. In this work, we aim to learn multivariate cumulative distribution functions (CDFs), as they can handle mixed random variables, allow efficient box probability evaluation, and have the potential to overcome local sample scarcity owing to their cumulative nature. We show that any grid sampled version of a joint CDF of mixed random variables admits a universal representation as a naive Bayes model via the Canonical Polyadic (tensor-rank) decomposition. By introducing a low-rank model, either directly in the raw data domain, or indirectly in a transformed (Copula) domain, the resulting model affords efficient sampling, closed form inference and uncertainty quantification, and comes with uniqueness guarantees under relatively mild conditions. We demonstrate the superior performance of the proposed model in several synthetic and real datasets and applications including regression, sampling and data imputation. Interestingly, our experiments with real data show that it is possible to obtain better density/mass estimates indirectly via a low-rank CDF model, than a low-rank PDF/PMF model.