论文标题

DBLOG:基于水印的变更数据捕获框架

DBLog: A Watermark Based Change-Data-Capture Framework

论文作者

Andreakis, Andreas, Papapanagiotou, Ioannis

论文摘要

对于应用程序,它是使用多个异质数据库的通常观察到的模式,在该数据库中每个数据库都用于满足特定需求,例如存储数据的规范形式或提供高级搜索功能。因此,对于应用程序,希望将多个数据库保持同步。我们观察到了一系列不同的模式,这些模式试图解决此问题,例如双写和分布式交易。但是,这些方法在可行性,鲁棒性和维护方面存在局限性。最近出现的一种替代方法是利用更改数据捕获(CDC),以捕获数据库交易日志中的更改行,并最终以低延迟为下游。为了解决数据同步问题,人们还需要复制数据库的完整状态,而事务日志通常不包含更改的完整历史记录。同时,有些用例需要高可用性事件,以便数据库保持尽可能紧密的同步。 为了应对上述挑战,我们为数据库(即DBLOG)开发了一种新颖的CDC框架。 DBLOG使用了基于水印的方法,该方法使我们能够将交易日志事件与我们直接从表中选择以捕获完整状态的行交易日志事件。我们的解决方案允许日志事件在处理选择时继续前进而无需停滞。可以随时在所有表,特定表或表的特定主键上触发选择。 DBLOG在块中执行选择,并跟踪进度,从而可以暂停和恢复。水印方法不使用锁,对源的影响最小。 DBLOG目前在Netflix上通过数十微波服务用于生产。

It is a commonly observed pattern for applications to utilize multiple heterogeneous databases where each is used to serve a specific need such as storing the canonical form of data or providing advanced search capabilities. For applications it is hence desired to keep multiple databases in sync. We have observed a series of distinct patterns that have tried to solve this problem such as dual-writes and distributed transactions. However, these approaches have limitations with regard to feasibility, robustness, and maintenance. An alternative approach that has recently emerged is to utilize Change-Data-Capture (CDC) in order to capture changed rows from a database's transaction log and eventually deliver them downstream with low latency. In order to solve the data synchronization problem one also needs to replicate the full state of a database and transaction logs typically do not contain the full history of changes. At the same time, there are use cases that require high availability of the transaction log events so that databases stay as closely in-sync as possible. To address the above challenges, we developed a novel CDC framework for databases, namely DBLog. DBLog utilizes a watermark based approach that allows us to interleave transaction log events with rows that we directly select from tables to capture the full state. Our solution allows log events to continue progress without stalling while processing selects. Selects can be triggered at any time on all tables, a specific table, or for specific primary keys of a table. DBLog executes selects in chunks and tracks progress, allowing them to pause and resume. The watermark approach does not use locks and has minimum impact on the source. DBLog is currently used in production by tens of microservices at Netflix.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源