HPC I/O机器学习模型中错误来源的分类法

论文标题

HPC I/O机器学习模型中错误来源的分类法

A Taxonomy of Error Sources in HPC I/O Machine Learning Models

论文作者

Isakov, Mihailo, Currier, Mikaela, del Rosario, Eliakin, Madireddy, Sandeep, Balaprakash, Prasanna, Carns, Philip, Ross, Robert B., Lockwood, Glenn K., Kinsy, Michel A.

论文摘要

I/O效率对于科学计算中的生产力至关重要，但是系统和应用程序的增加使从业者很难在大规模上理解和优化I/O行为。基于数据驱动的机器学习的I/O吞吐量模型提供了一种解决方案：它们可用于识别瓶颈，自动化I/O调整或使用最少的人类干预来优化工作计划。不幸的是，当前最新的I/O模型不足以在部署后用于生产使用和表现不佳。我们分析了两个领导力类HPC平台上的多年应用，调度程序和存储系统日志，以了解为什么I/O模型在实践中表现不佳。我们提出了一个分类法，该分类法包括I/O建模错误的五类：应用程序和系统建模差，数据集覆盖率不足，I/O争夺和I/O噪声。我们开发了试金测试以量化每个类别，使研究人员能够缩小故障模式，增强I/O吞吐量模型，并改善子孙后代的HPC日志记录和分析工具。

I/O efficiency is crucial to productivity in scientific computing, but the increasing complexity of the system and the applications makes it difficult for practitioners to understand and optimize I/O behavior at scale. Data-driven machine learning-based I/O throughput models offer a solution: they can be used to identify bottlenecks, automate I/O tuning, or optimize job scheduling with minimal human intervention. Unfortunately, current state-of-the-art I/O models are not robust enough for production use and underperform after being deployed. We analyze multiple years of application, scheduler, and storage system logs on two leadership-class HPC platforms to understand why I/O models underperform in practice. We propose a taxonomy consisting of five categories of I/O modeling errors: poor application and system modeling, inadequate dataset coverage, I/O contention, and I/O noise. We develop litmus tests to quantify each category, allowing researchers to narrow down failure modes, enhance I/O throughput models, and improve future generations of HPC logging and analysis tools.

下载PDF全文

下载文献需遵守相关版权规定

论文标题