论文标题
情人节:评估数据集发现的匹配技术
Valentine: Evaluating Matching Techniques for Dataset Discovery
论文作者
论文摘要
数据科学家今天搜索大型数据湖泊以发现和集成数据集。为了汇总不同的数据源,数据集发现方法依赖于某种形式的模式匹配:在数据集之间建立对应关系的过程。传统上,模式匹配已用于查找源和目标模式之间的匹配列对。但是,在数据集发现方法中使用模式匹配的使用与其原始用途不同。如今,模式匹配是指示和排名数据间关系的基础。令人惊讶的是,尽管发现方法的成功很大程度上取决于基本匹配算法的质量,但由于缺乏与地面真理,参考方法实现和评估衡量指标,最新的发现方法以临时方式采用了现有的架构匹配算法。在本文中,我们旨在纠正评估数据集发现特定需求的模式匹配方法的有效性和效率的问题。为此,我们提出了Valentine,这是一个可扩展的开源实验套件,可以在表格数据上执行和组织大规模的自动匹配实验。 Valentine包括我们从头开始实现的精确模式匹配方法(由于缺乏开源代码)或从开放存储库中导入的实现。 The contributions of Valentine are: i) the definition of four schema matching scenarios as encountered in dataset discovery methods, ii) a principled dataset fabrication process tailored to the scope of dataset discovery methods and iii) the most comprehensive evaluation of schema matching techniques to date, offering insight on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset发现方法。
Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method's success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics. In this paper, we aim to rectify the problem of evaluating the effectiveness and efficiency of schema matching methods for the specific needs of dataset discovery. To this end, we propose Valentine, an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. The contributions of Valentine are: i) the definition of four schema matching scenarios as encountered in dataset discovery methods, ii) a principled dataset fabrication process tailored to the scope of dataset discovery methods and iii) the most comprehensive evaluation of schema matching techniques to date, offering insight on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods.