论文标题

METL:具有动态映射矩阵的现代ETL管道

METL: a modern ETL pipeline with a dynamic mapping matrix

论文作者

Haase, Christian, Röseler, Timo, Seidel, Mattias

论文摘要

现代ETL流媒体管道从各种来源提取数据,并将其转发给多个消费者,例如利用机器学习(ML)的数据仓库(DW)和分析系统。但是,与此类管道连接的系统数量越来越多,需要新的解决方案进行数据集成。规范(或常见)数据模型(CDM)提供了这样的集成。它对于将微服务系统集成到ETL管道中特别有用。 (Villaca等,2020年,Oliveira等,2019)但是,映射到CDM很复杂。 (Lemcke等,2012年)存在三个复杂性问题,即所需的映射矩阵的大小,矩阵更新的自动化,以响应提取源的变化以及映射的时间效率。在本文中,我们为这些问题提供了一个新解决方案。更确切地说,我们提出了一个新的动态映射矩阵(DMM),该矩阵基于置换矩阵,这些矩阵是通过块分配完整的映射矩阵获得的。我们表明,DMM可用于响应模式更改,用于实时和高效的压实,用于自动更新。对于解决方案,我们借鉴了矩阵分区(Quinn 2004)和动态网络(Haase等2021)的研究。 DMM已实现到名为Message ETL(METL)的应用中。 METL是EOS上新的ETL流媒体管道的关键部分,该管道将转换为CDM。 ETL管道基于Kafka-streams。它从具有基于日志的更改数据捕获(CDC)的80多个微服务中提取数据,并将数据加载到DW和ML平台。 EOS是Otto-Group的一部分,Otto-Group是欧洲第二大电子商务提供商。

Modern ETL streaming pipelines extract data from various sources and forward it to multiple consumers, such as data warehouses (DW) and analytical systems that leverage machine learning (ML). However, the increasing number of systems that are connected to such pipelines requires new solutions for data integration. The canonical (or common) data model (CDM) offers such an integration. It is particular useful for integrating microservice systems into ETL pipelines. (Villaca et al 2020, Oliveira et al 2019) However, a mapping to a CDM is complex. (Lemcke et al 2012) There are three complexity problems, namely the size of the required mapping matrix, the automation of updates of the matrix in response to changes in the extraction sources and the time efficiency of the mapping. In this paper, we present a new solution for these problems. More precisely, we present a new dynamic mapping matrix (DMM), which is based on permutation matrices that are obtained by block-partitioning the full mapping matrix. We show that the DMM can be used for automated updates in response to schema changes, for parallel computation in near real-time and for highly efficient compacting. For the solution, we draw on research into matrix partitioning (Quinn 2004) and dynamic networks (Haase et al 2021). The DMM has been implemented into an app called Message ETL (METL). METL is the key part of a new ETL streaming pipeline at EOS that conducts the transformation to a CDM. The ETL pipeline is based on Kafka-streams. It extracts data from more than 80 microservices with log-based Change Data Capture (CDC) with Debezium and loads the data to a DW and an ML platform. EOS is part of the Otto-Group, the second-largest e-commerce provider in Europe.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源