论文标题
PCLEEN:使用特定域概率编程大规模清洁贝叶斯数据
PClean: Bayesian Data Cleaning at Scale with Domain-Specific Probabilistic Programming
论文作者
论文摘要
数据清洁自然被构成基于基地数据和可能误差的生成模型中的概率推断,但是现实世界中误差模式的多样性和推断的硬度使贝叶斯的方法很难自动化。我们提出了PCLEAN,这是一种概率编程语言(PPL),用于利用数据集特定知识来自动化贝叶斯清洁。与通用PPL相比,PCLEEN可以解决一个受限制的问题域,实现了三个建模和推理创新:(1)一个非参数的关系数据库实例模型,用户的程序自定义了; (2)一种新型的顺序蒙特卡洛推理算法,可利用Pclean模型类的结构; (3)基于用户的模型和数据生成近乎最佳的SMC建议和阻止GIBBS的编译器。我们从经验上表明,短(<50线)PCLEAN程序可以:比对数据清洁基准的通用PPL推断更快,更准确;就准确性和运行时匹配最新的数据清洁系统(与同一运行时的通用PPL推断不同);并扩展到具有数百万记录的现实世界数据集。
Data cleaning is naturally framed as probabilistic inference in a generative model of ground-truth data and likely errors, but the diversity of real-world error patterns and the hardness of inference make Bayesian approaches difficult to automate. We present PClean, a probabilistic programming language (PPL) for leveraging dataset-specific knowledge to automate Bayesian cleaning. Compared to general-purpose PPLs, PClean tackles a restricted problem domain, enabling three modeling and inference innovations: (1) a non-parametric model of relational database instances, which users' programs customize; (2) a novel sequential Monte Carlo inference algorithm that exploits the structure of PClean's model class; and (3) a compiler that generates near-optimal SMC proposals and blocked-Gibbs rejuvenation kernels based on the user's model and data. We show empirically that short (< 50-line) PClean programs can: be faster and more accurate than generic PPL inference on data-cleaning benchmarks; match state-of-the-art data-cleaning systems in terms of accuracy and runtime (unlike generic PPL inference in the same runtime); and scale to real-world datasets with millions of records.