论文标题
生存模式的变化:使用深度学习的非管理数据集成
Survive the Schema Changes: Integration of Unmanaged Data Using Deep Learning
论文作者
论文摘要
数据是AI时代的国王。但是,数据集成通常是一项艰巨的任务,很难自动化。模式变化是自动化端到端数据集成过程的重要障碍。尽管存在诸如查询发现和架构修改语言之类的机制来解决问题,但这些方法只能与数据库保持架构维护的假设合作。但是,我们观察到异质数据和开放数据的多样化模式变化,其中大多数没有架构定义。在这项工作中,我们建议使用深度学习来自动处理模式变化,并自动注入训练数据的扰动,以使模型可靠地对模式变化。我们的实验结果表明,我们提出的方法对两个现实世界数据集成方案有效:冠状病毒数据集成和机器日志集成。
Data is the king in the age of AI. However data integration is often a laborious task that is hard to automate. Schema change is one significant obstacle to the automation of the end-to-end data integration process. Although there exist mechanisms such as query discovery and schema modification language to handle the problem, these approaches can only work with the assumption that the schema is maintained by a database. However, we observe diversified schema changes in heterogeneous data and open data, most of which has no schema defined. In this work, we propose to use deep learning to automatically deal with schema changes through a super cell representation and automatic injection of perturbations to the training data to make the model robust to schema changes. Our experimental results demonstrate that our proposed approach is effective for two real-world data integration scenarios: coronavirus data integration, and machine log integration.