DPU-V2：不规则定向无环图的能源有效执行

论文标题

DPU-V2：不规则定向无环图的能源有效执行

DPU-v2: Energy-efficient execution of irregular directed acyclic graphs

论文作者

Shah, Nimish, Meert, Wannes, Verhelst, Marian

论文摘要

越来越多的应用程序，例如概率机器学习，稀疏线性代数，机器人导航等，显示出不规则的数据流计算，可以用定向的无环形图（DAGS）对其进行建模。不规则性来自看似随机的节点的连接，这使得DAG结构不适合在CPU或GPU上进行矢量化。此外，节点通常代表少量的算术操作，这些操作无法摊销每个节点启动任务/内核的开销，这进一步构成了并行执行的挑战。为了实现节能执行，这项工作提出了DAG处理单元（DPU）版本2，这是一种针对具有静态连接性的不规则DAG优化的专门处理器体系结构。它由树木结构化的数据索组成，用于有效的数据重复使用，定制的银行寄存器文件，并调整了以支持不规则寄存器访问的互连。 DPU-V2通过系统地将操作映射到DataPath的目标编译器有效地利用了DPU-V2，从而最大程度地减少了注册银行冲突，并避免了管道危害。最后，设计空间探索确定了最大程度地减少能量延迟产品的最佳体系结构。这种硬件软件合作的方法导致1.4 $ \ times $，3.5 $ \ times $和14 $ \ times $的加速速度分别为最先进的DAG处理器ASIP，CPU和GPU，同时也可以实现较低的能量销售产品。这样，这项工作朝着实现新兴DAG工作负载的嵌入式执行迈出了重要的一步。

A growing number of applications like probabilistic machine learning, sparse linear algebra, robotic navigation, etc., exhibit irregular data flow computation that can be modeled with directed acyclic graphs (DAGs). The irregularity arises from the seemingly random connections of nodes, which makes the DAG structure unsuitable for vectorization on CPU or GPU. Moreover, the nodes usually represent a small number of arithmetic operations that cannot amortize the overhead of launching tasks/kernels for each node, further posing challenges for parallel execution. To enable energy-efficient execution, this work proposes DAG processing unit (DPU) version 2, a specialized processor architecture optimized for irregular DAGs with static connectivity. It consists of a tree-structured datapath for efficient data reuse, a customized banked register file, and interconnects tuned to support irregular register accesses. DPU-v2 is utilized effectively through a targeted compiler that systematically maps operations to the datapath, minimizes register bank conflicts, and avoids pipeline hazards. Finally, a design space exploration identifies the optimal architecture configuration that minimizes the energy-delay product. This hardware-software co-optimization approach results in a speedup of 1.4$\times$, 3.5$\times$, and 14$\times$ over a state-of-the-art DAG processor ASIP, a CPU, and a GPU, respectively, while also achieving a lower energy-delay product. In this way, this work takes an important step toward enabling an embedded execution of emerging DAG workloads.

下载PDF全文

下载文献需遵守相关版权规定

论文标题