论文标题
远程内存访问编程模型的容错
Fault Tolerance for Remote Memory Access Programming Models
论文作者
论文摘要
远程内存访问(RMA)是用于编程高性能计算机和数据中心的新兴机制。但是,关于基于RMA的应用程序和系统的弹性方案的工作很少。在本文中,我们分析了RMA的容错性,并表明它与针对消息传递(MP)模型的弹性机制根本不同。我们设计了一个模型,用于推理RMA的错误公差,并解决平面和层次硬件。我们使用此模型来构建几种高度可观的机制,这些机制可提供有效的低空内存检查点,远程存储器访问的透明记录以及一种用于透明的失败过程恢复的方案。我们的协议考虑了每个核心的内存数量减少,这是未来Exascale机器的主要特征之一。我们容忍缺陷计划的实施需要可忽略的其他开销。我们的可靠性模型表明,内存检查点和记录具有很高的弹性。这项研究实现了高度可观的弹性机制,用于RMA,并填补了容错和新兴的RMA编程模型之间的研究差距。
Remote Memory Access (RMA) is an emerging mechanism for programming high-performance computers and datacenters. However, little work exists on resilience schemes for RMA-based applications and systems. In this paper we analyze fault tolerance for RMA and show that it is fundamentally different from resilience mechanisms targeting the message passing (MP) model. We design a model for reasoning about fault tolerance for RMA, addressing both flat and hierarchical hardware. We use this model to construct several highly-scalable mechanisms that provide efficient low-overhead in-memory checkpointing, transparent logging of remote memory accesses, and a scheme for transparent recovery of failed processes. Our protocols take into account diminishing amounts of memory per core, one of major features of future exascale machines. The implementation of our fault-tolerance scheme entails negligible additional overheads. Our reliability model shows that in-memory checkpointing and logging provide high resilience. This study enables highly-scalable resilience mechanisms for RMA and fills a research gap between fault tolerance and emerging RMA programming models.