论文标题
高性能计算系统的应用程序感知拥堵缓解
Application-aware Congestion Mitigation for High-Performance Computing Systems
论文作者
论文摘要
高性能计算(HPC)系统经常经历拥塞,从而导致严重的应用性能变化。但是,拥堵对应用程序运行时的影响因应用程序的应用程序特征(例如带宽和延迟需求)而异。我们利用这种见解来开发NetScope,这是一种自动化的ML驱动框架,它考虑了这些网络特征以动态减轻拥塞。我们评估了四个Cray Aries系统的NetScope,包括有关实际科学应用的生产超级计算机。 NetScope的培训成本较低,并且准确地估计了拥塞对应用程序运行时的影响,相关性在0.7和0.9之间,对于常见科学应用而言。此外,我们发现NetScope将尾部运行时变异性降低了14.9倍,同时将中位系统实用程序提高了12%。
High-performance computing (HPC) systems frequently experience congestion leading to significant application performance variation. However, the impact of congestion on application runtime differs from application to application depending on their network characteristics (such as bandwidth and latency requirements). We leverage this insight to develop Netscope, an automated ML-driven framework that considers those network characteristics to dynamically mitigate congestion. We evaluate Netscope on four Cray Aries systems, including a production supercomputer on real scientific applications. Netscope has a lower training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7and 0.9 for common scientific applications. Moreover, we find that Netscope reduces tail runtime variability by up to 14.9 times while improving median system utility by 12%.