论文标题
从合并框架到合并星星:使用HPX,Kokkos和Simd类型的体验
From Merging Frameworks to Merging Stars: Experiences using HPX, Kokkos and SIMD Types
论文作者
论文摘要
Octo-Tiger是一种用于恒星合并的大规模3D AMR代码,使用了HPX,Kokkos和显式SIMD类型的组合,旨在实现多种异构硬件的性能 - 可通用性。但是,在A64FX CPU上,我们遇到了几个缺失的作品,从而通过引起SIMD矢量化问题来阻碍性能。因此,我们将STD ::实验:: SIMD添加为与Kokkos Simd一起在Octo-Tiger的Kokkos内核中使用的选项,并进一步添加了新的SVE(可扩展向量扩展)SIMD Backend。此外,我们修改了Octo-Tiger Hydro求解器中的Kokkos内核中缺少SIMD实现。我们通过在三种不同的CPU上运行Octo-Tiger来测试我们的变化:A64FX,Intel Icelake和AMD EPYC CPU,评估SIMD速度和节点级的性能。我们在A64FX CPU上获得了良好的SIMD加速,以及其他两个CPU平台上的明显加速。但是,我们还在EPYC CPU上遇到了扩展问题。
Octo-Tiger, a large-scale 3D AMR code for the merger of stars, uses a combination of HPX, Kokkos and explicit SIMD types, aiming to achieve performance-portability for a broad range of heterogeneous hardware. However, on A64FX CPUs, we encountered several missing pieces, hindering performance by causing problems with the SIMD vectorization. Therefore, we add std::experimental::simd as an option to use in Octo-Tiger's Kokkos kernels alongside Kokkos SIMD, and further add a new SVE (Scalable Vector Extensions) SIMD backend. Additionally, we amend missing SIMD implementations in the Kokkos kernels within Octo-Tiger's hydro solver. We test our changes by running Octo-Tiger on three different CPUs: An A64FX, an Intel Icelake and an AMD EPYC CPU, evaluating SIMD speedup and node-level performance. We get a good SIMD speedup on the A64FX CPU, as well as noticeable speedups on the other two CPU platforms. However, we also experience a scaling issue on the EPYC CPU.