论文标题
扩展疯狂:特定于设备的优化
Extending MadFlow: device-specific optimization
论文作者
论文摘要
在此程序中,我们证明了在硬件加速代码开发中的顶部方法的一些优势。我们从自动化的硬件无形蒙特卡洛发生器开始,该发电机在事件轴上并行。这使我们能够利用蒙特卡洛积分的可行性质,即使我们无法控制计算运行的硬件(即外部群集)。这种实现的通用性质可以引入虚假的瓶颈或开销。幸运的是,瓶颈通常仅限于操作的一部分,而不是整个矢量化程序。通过确定计算中更关键的部分,可以获得非常有效的代码,同时最大程度地减少需要编写的硬件特定代码量。我们展示了基准测试,以证明如何简单地减少计算的内存足迹可以提高$ 2 \ $ 4 $流程的性能。
In this proceedings we demonstrate some advantages of a top-bottom approach in the development of hardware-accelerated code. We start with an autogenerated hardware-agnostic Monte Carlo generator, which is parallelized in the event axis. This allow us to take advantage of the parallelizable nature of Monte Carlo integrals even if we don't have control of the hardware in which the computation will run (i.e., an external cluster). The generic nature of such an implementation can introduce spurious bottlenecks or overheads. Fortunately, said bottlenecks are usually restricted to a subset of operations and not to the whole vectorized program. By identifying the more critical parts of the calculation one can get very efficient code and at the same time minimize the amount of hardware-specific code that needs to be written. We show benchmarks demonstrating how simply reducing the memory footprint of the calculation can increase the performance of a $2 \to 4$ process.