论文标题

根据数据湖的物理设计优化联合查询

Optimizing Federated Queries Based on the Physical Design of a Data Lake

论文作者

Rohde, Philipp D., Vidal, Maria-Esther

论文摘要

已知查询执行计划的优化对于减少查询执行时间至关重要。特别是,在过去的几十年中,已经对关系数据库进行了彻底研究。最近,资源描述框架(RDF)在网络上发布数据而受欢迎。结果,由RDF和关系数据库等不同数据模型组成的联合会发展了。这些联合会的一种类型是语义数据湖泊,其中每个数据源都保存在其原始数据模型中,并用本体论或受控词汇进行语义注释。但是,针对联合查询处理的最新查询引擎在语义数据湖上的处理通常依赖于针对RDF量身定制的优化技术。在本文中,我们提出了受启发式方法指导的查询优化技术,这些技术将数据湖的物理设计考虑在内。该启发式方法是在安大略省的基础上实施的,安大略省是一种用于语义数据湖泊的SPARQL查询引擎。使用特定于源的启发式方法,查询引擎能够通过利用有关关系数据库中的索引和归一化的知识来生成更有效的查询执行计划。我们表明,将数据湖的物理设计的启发式方法能够加快查询处理。

The optimization of query execution plans is known to be crucial for reducing the query execution time. In particular, query optimization has been studied thoroughly for relational databases over the past decades. Recently, the Resource Description Framework (RDF) became popular for publishing data on the Web. As a consequence, federations composed of different data models like RDF and relational databases evolved. One type of these federations are Semantic Data Lakes where every data source is kept in its original data model and semantically annotated with ontologies or controlled vocabularies. However, state-of-the-art query engines for federated query processing over Semantic Data Lakes often rely on optimization techniques tailored for RDF. In this paper, we present query optimization techniques guided by heuristics that take the physical design of a Data Lake into account. The heuristics are implemented on top of Ontario, a SPARQL query engine for Semantic Data Lakes. Using source-specific heuristics, the query engine is able to generate more efficient query execution plans by exploiting the knowledge about indexes and normalization in relational databases. We show that heuristics which take the physical design of the Data Lake into account are able to speed up query processing.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源