论文标题
Sudowoodo:多功能数据集成和准备的对比度自我监督学习
Sudowoodo: Contrastive Self-supervised Learning for Multi-purpose Data Integration and Preparation
论文作者
论文摘要
机器学习(ML)在数据管理任务中起着越来越重要的作用,尤其是在数据集成和准备中(DI&P)。但是,基于ML的方法的成功在很大程度上依赖于用于不同任务的大规模,高质量标签的数据集的可用性。此外,各种DI&P任务和管道通常需要自定义ML解决方案,这可以为模型工程和实验带来巨大的成本。这些因素不可避免地阻止采用基于ML的方法对新领域和任务。 在本文中,我们提出了Sudowoodo,这是一个基于对比表示学习的多功能DI&P框架。 Sudowoodo具有一个统一的,基于匹配的问题定义,捕获了广泛的DI&P任务,包括数据集成中的实体匹配(EM),数据清洁中的错误校正,数据发现中的语义类型检测等等。对比学习使Sudowoodo能够从大量数据项(例如,实体条目,表列)中学习相似性 - 感知的数据表示,而无需使用任何标签。以后可以直接使用学习的表示形式,也可以仅使用几个标签来支持不同的DI&P任务。我们的实验结果表明,Sudowoodo在不同级别的监督水平上实现了多个最先进的结果,并且优于以前的EM最佳专业阻止或匹配解决方案。 Sudowoodo还可以在数据清洁和语义类型检测任务中实现有希望的结果,以显示其在DI&P应用中的多功能性。
Machine learning (ML) is playing an increasingly important role in data management tasks, particularly in Data Integration and Preparation (DI&P). The success of ML-based approaches, however, heavily relies on the availability of large-scale, high-quality labeled datasets for different tasks. Moreover, the wide variety of DI&P tasks and pipelines oftentimes requires customizing ML solutions which can incur a significant cost for model engineering and experimentation. These factors inevitably hold back the adoption of ML-based approaches to new domains and tasks. In this paper, we propose Sudowoodo, a multi-purpose DI&P framework based on contrastive representation learning. Sudowoodo features a unified, matching-based problem definition capturing a wide range of DI&P tasks including Entity Matching (EM) in data integration, error correction in data cleaning, semantic type detection in data discovery, and more. Contrastive learning enables Sudowoodo to learn similarity-aware data representations from a large corpus of data items (e.g., entity entries, table columns) without using any labels. The learned representations can later be either directly used or facilitate fine-tuning with only a few labels to support different DI&P tasks. Our experiment results show that Sudowoodo achieves multiple state-of-the-art results on different levels of supervision and outperforms previous best specialized blocking or matching solutions for EM. Sudowoodo also achieves promising results in data cleaning and semantic type detection tasks showing its versatility in DI&P applications.