论文标题

集体矢量时钟:MPI的低空透明检查点

Collective Vector Clocks: Low-Overhead Transparent Checkpointing for MPI

论文作者

Xu, Yao, Cooperman, Gene

论文摘要

拍摄分布式计算状态的快照可用于对计算状态的离线分析,以后从保存的快照重新启动,用于克隆计算的副本以及迁移到新集群。当跨流程(例如障碍,减少操作,散射和收集等)支持集体操作时,问题变得更加困难。一些过程可能已经达到了障碍或其他集体操作,而其他过程则需要很长时间才能达到相同的障碍或集体操作。文献中至少有两种解决方案是众所周知的:(i)在检查站时间中排出机上网络消息,然后冷冻网络; (ii)在集体操作之前添加障碍,并完成操作或流产障碍,如果不是所有过程。两种解决方案都有重要的缺点。只要一个端口端到一个较新的网络,就必须更新第一个解决方案中的代码。第二个解决方案意味着在每个集体操作之前,其他与屏障相关的网络流量。这项工作提出了避免这两个缺点的第三个解决方案。没有其他与屏障相关的流量,该解决方案完全在网络层上方实现。这项工作是在MPI库的透明检查点上进行的,用于并行计算,其中前两个解决方案中的每一个都已经在先前的系统中使用,然后由于上述缺陷而被放弃。实验证明了这种新的网络无关方法的运行时开销较低。该方法还扩展到非阻滞,集体操作,以处理计算和通信的重叠。

Taking snapshots of the state of a distributed computation is useful for off-line analysis of the computational state, for later restarting from the saved snapshot, for cloning a copy of the computation, and for migration to a new cluster. The problem is made more difficult when supporting collective operations across processes, such as barrier, reduce operations, scatter and gather, etc. Some processes may have reached the barrier or other collective operation, while other processes wait a long time to reach that same barrier or collective operation. At least two solutions are well-known in the literature: (I) draining in-flight network messages and then freezing the network at checkpoint time; and (ii) adding a barrier prior to the collective operation, and either completing the operation or aborting the barrier if not all processes are present. Both solutions suffer important drawbacks. The code in the first solution must be updated whenever one ports to a newer network. The second solution implies additional barrier-related network traffic prior to each collective operation. This work presents a third solution that avoids both drawbacks. There is no additional barrier-related traffic, and the solution is implemented entirely above the network layer. The work is demonstrated in the context of transparent checkpointing of MPI libraries for parallel computation, where each of the first two solutions have already been used in prior systems, and then abandoned due to the aforementioned drawbacks. Experiments demonstrate the low runtime overhead of this new, network-agnostic approach. The approach is also extended to non-blocking, collective operations in order to handle overlapping of computation and communication.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源