论文标题

扩展知识图创建到大而异构的数据源

Scaling Up Knowledge Graph Creation to Large and Heterogeneous Data Sources

论文作者

Iglesias, Enrique, Jozashoori, Samaneh, Vidal, Maria-Esther

论文摘要

RDF知识图(kg)是强大的数据结构,可以表示由异质数据源创建的事实语句。 KG的创建很费力,需要有效地执行数据管理技术。本文解决了自动生成KG创建过程的问题;它提出了在RDF映射语言(RML)中指定的映射断言之后,用于计划和将异质数据计划和转换为RDF三元组的技术。给定一组映射断言,计划者通过分区和安排断言的执行来提供优化的执行计划。首先,考虑到数据源的数量,映射断言的类型以及不同断言之间的关联,计划者评估了优化数量的分区数量。在提供属于每个分区的分区和断言列表之后,计划者确定其执行命令。实施了一种贪婪的算法来生成分区的浓密树执行计划。浓密的树计划被转化为操作系统命令,以指导灌木树指示的顺序执行映射断言的分区。提出的优化方法对最新的RML兼容发动机以及数据源和RML Triples图的现有基准进行了评估。我们的实验结果表明,所研究的引擎的性能可以大大改善,尤其是在复杂的环境中,具有大量的三元图和大数据源。结果,在复杂情况下超时的引擎可以至少生产出应用计划者的一部分。

RDF knowledge graphs (KG) are powerful data structures to represent factual statements created from heterogeneous data sources. KG creation is laborious and demands data management techniques to be executed efficiently. This paper tackles the problem of the automatic generation of KG creation processes declaratively specified; it proposes techniques for planning and transforming heterogeneous data into RDF triples following mapping assertions specified in the RDF Mapping Language (RML). Given a set of mapping assertions, the planner provides an optimized execution plan by partitioning and scheduling the execution of the assertions. First, the planner assesses an optimized number of partitions considering the number of data sources, type of mapping assertions, and the associations between different assertions. After providing a list of partitions and assertions that belong to each partition, the planner determines their execution order. A greedy algorithm is implemented to generate the partitions' bushy tree execution plan. Bushy tree plans are translated into operating system commands that guide the execution of the partitions of the mapping assertions in the order indicated by the bushy tree. The proposed optimization approach is evaluated over state-of-the-art RML-compliant engines, and existing benchmarks of data sources and RML triples maps. Our experimental results suggest that the performance of the studied engines can be considerably improved, particularly in a complex setting with numerous triples maps and large data sources. As a result, engines that time out in complex cases are enabled to produce at least a portion of the KG applying the planner.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源