从基于任务的GPU工作聚合到出色的合并：将细粒的CPU任务转换为便携式GPU内核

论文标题

从基于任务的GPU工作聚合到出色的合并：将细粒的CPU任务转换为便携式GPU内核

From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels

论文作者

Daiß, Gregor, Diehl, Patrick, Marcello, Dominic, Kheirkhahan, Alireza, Kaiser, Hartmut, Pflüger, Dirk

论文摘要

满足可扩展性和性能可移植性要求对于任何HPC应用程序来说都是一个挑战，尤其是对于适应性精制的应用程序。在Acto-Tiger（用于模拟恒星合并的模拟的Acto-Tiger）中，我们使用现有的解决方案来解决此问题：我们采用HPX来获取细粒度的任务，以轻松分发工作并精细重叠的通信和计算。对于计算本身，我们使用Kokkos将这些任务转换为能够在硬件上运行的计算内核，从几个CPU内核到功能强大的加速器。但是，有一个缺少的链接：虽然HPX暴露的细粒并行性对于可伸缩性很有用，但当任务变得太小而无法饱和设备，导致资源较低的利用率时，它可能会阻碍GPU性能。为了弥合这一差距，我们研究了Octo-Tiger内的多种不同的GPU工作聚合策略，增加了一个新策略，并评估了节点级的性能对最近的AMD和NVIDIA GPU，从而实现了明显的加速。

Meeting both scalability and performance portability requirements is a challenge for any HPC application, especially for adaptively refined ones. In Octo-Tiger, an astrophysics application for the simulation of stellar mergers, we approach this with existing solutions: We employ HPX to obtain fine-grained tasks to easily distribute work and finely overlap communication and computation. For the computations themselves, we use Kokkos to turn these tasks into compute kernels capable of running on hardware ranging from a few CPU cores to powerful accelerators. There is a missing link, however: while the fine-grained parallelism exposed by HPX is useful for scalability, it can hinder GPU performance when the tasks become too small to saturate the device, causing low resource utilization. To bridge this gap, we investigate multiple different GPU work aggregation strategies within Octo-Tiger, adding one new strategy, and evaluate the node-level performance impact on recent AMD and NVIDIA GPUs, achieving noticeable speedups.

下载PDF全文

下载文献需遵守相关版权规定

论文标题