论文标题

数据分析工作流程原理

Principles for data analysis workflows

论文作者

Stoudt, Sara, Vasquez, Valeri N., Martinez, Ciera C.

论文摘要

传统的数据科学教育通常会省略有关研究工作流程的培训:将科学研究从原始数据转变为连贯的研究问题的过程,再到有见地的贡献。在本文中,我们通过定义三个阶段:探索性,改进和抛光阶段来详细阐述可再现数据分析工作流程的基本原理。每个工作流阶段都大致围绕着研究决定,方法论和结果立即传达的受众。重要的是,每个阶段还可以引起传统学术出版物以外的许多研究产品。在相关的情况下,我们在数据密集型研究工作流程和软件开发方面的既定实践之间进行了类比。此处提供的指南并非是严格的规则手册;相反,提出可再现的,合理的数据密集分析的实践和工具的建议可能会为学生和现有专业人员提供支持。

Traditional data science education often omits training on research workflows: the process that moves a scientific investigation from raw data to coherent research question to insightful contribution. In this paper, we elaborate basic principles of a reproducible data analysis workflow by defining three phases: the Exploratory, Refinement, and Polishing Phases. Each workflow phase is roughly centered around the audience to whom research decisions, methodologies, and results are being immediately communicated. Importantly, each phase can also give rise to a number of research products beyond traditional academic publications. Where relevant, we draw analogies between principles for data-intensive research workflows and established practice in software development. The guidance provided here is not intended to be a strict rulebook; rather, the suggestions for practices and tools to advance reproducible, sound data-intensive analysis may furnish support for both students and current professionals.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源