基于顶点剪切的框架，用于多核系统中的负载平衡和并行性优化

论文标题

基于顶点剪切的框架，用于多核系统中的负载平衡和并行性优化

A Vertex Cut based Framework for Load Balancing and Parallelism Optimization in Multi-core Systems

论文作者

Ma, Guixiang, Xiao, Yao, Willke, Theodore L., Ahmed, Nesreen K., Nazarian, Shahin, Bogdan, Paul

论文摘要

高级应用程序（例如机器学习）正在从基于多层感知器的简单模型中发展，以简单识别到更深入且更复杂的神经网络，以进行自动驾驶汽车控制系统。这些模型的快速增加记忆和计算资源的消耗需要多核平行系统的使用来使用多核平行系统，以扩展其依赖于其依赖于其附近的复杂应用程序的执行。但是，在高性能计算机上运行的并行程序通常会由于不规则的关键部分而受到数据通信瓶颈，有限的内存带宽和同步开销。在本文中，我们提出了一个框架，以减少数据通信并改善多核系统中这些应用程序的可扩展性和性能。我们设计了一个顶点切割框架，用于将LLVM IR图将其划分为簇，同时考虑群集之间的数据通信和工作负载平衡。首先，我们通过将高级程序编译到LLVM IR，仪器代码以获取基本块的执行顺序和每个内存操作的执行时间，并分析动态LLVM痕迹中的数据依赖项，从而构建LLVM图。接下来，我们将问题提出为平衡的$ p $ -way顶点切割，并提出了一个通用且灵活的框架，其中提出了四种不同的贪婪算法来解决此问题。最后，我们提出了一个以内存为中心的线性时间复杂性的以内存为中心的运行时间映射，以映射从顶点切割算法生成的群集到多核平台上。我们得出的结论是，我们最好的算法WB-LIBRA分别在8和1024个群集上分别在多核平台上运行的8和1024群集提供了1.56倍和1.86倍的性能改进。

High-level applications, such as machine learning, are evolving from simple models based on multilayer perceptrons for simple image recognition to much deeper and more complex neural networks for self-driving vehicle control systems.The rapid increase in the consumption of memory and computational resources by these models demands the use of multi-core parallel systems to scale the execution of the complex emerging applications that depend on them. However, parallel programs running on high-performance computers often suffer from data communication bottlenecks, limited memory bandwidth, and synchronization overhead due to irregular critical sections. In this paper, we propose a framework to reduce the data communication and improve the scalability and performance of these applications in multi-core systems. We design a vertex cut framework for partitioning LLVM IR graphs into clusters while taking into consideration the data communication and workload balance among clusters. First, we construct LLVM graphs by compiling high-level programs into LLVM IR, instrumenting code to obtain the execution order of basic blocks and the execution time for each memory operation, and analyze data dependencies in dynamic LLVM traces. Next, we formulate the problem as Weight Balanced $p$-way Vertex Cut, and propose a generic and flexible framework, wherein four different greedy algorithms are proposed for solving this problem. Lastly, we propose a memory-centric run-time mapping of the linear time complexity to map clusters generated from the vertex cut algorithms onto a multi-core platform. We conclude that our best algorithm, WB-Libra, provides performance improvements of 1.56x and 1.86x over existing state-of-the-art approaches for 8 and 1024 clusters running on a multi-core platform, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题