论文标题
端到端数据分析的多层优化
Multi-layer Optimizations for End-to-End Data Analytics
论文作者
论文摘要
我们考虑了培训机器学习模型在多关系数据上的问题。主流方法是首先使用输入数据库的功能提取查询构建训练数据集,然后使用首选统计软件包来训练模型。在本文中,我们介绍了迭代功能汇总查询(IFAQ),该查询是一种实现替代方法的框架。 IFAQ将功能提取查询和学习任务视为IFAQ特定域特异性语言中给出的一个程序,该语言捕获了jupyter笔记本中常用的Python子集,用于快速对机器学习应用程序进行快速原型化。该程序受IFAQ优化的几层约束,例如代数转换,循环转换,架构专业化,数据布局优化,并最终汇编为有效的低级C ++代码,专门针对给定的工作量和数据专门。 我们表明,IFAQ的Scala实现可以通过几个数量级来超越MLPACK,SCIKIT和TENSORFLOW,用于在几个关系数据集上线性回归和回归树模型的几个数量级。
We consider the problem of training machine learning models over multi-relational data. The mainstream approach is to first construct the training dataset using a feature extraction query over input database and then use a statistical software package of choice to train the model. In this paper we introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach. IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language, which captures a subset of Python commonly used in Jupyter notebooks for rapid prototyping of machine learning applications. The program is subject to several layers of IFAQ optimizations, such as algebraic transformations, loop transformations, schema specialization, data layout optimizations, and finally compilation into efficient low-level C++ code specialized for the given workload and data. We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and TensorFlow by several orders of magnitude for linear regression and regression tree models over several relational datasets.