一种半偶联的方法，用于神经加速器的快速和最佳软件软件共同设计

论文标题

一种半偶联的方法，用于神经加速器的快速和最佳软件软件共同设计

A Semi-Decoupled Approach to Fast and Optimal Hardware-Software Co-Design of Neural Accelerators

论文作者

Lu, Bingqian, Yan, Zheyu, Shi, Yiyu, Ren, Shaolei

论文摘要

鉴于神经体系结构和加速器的完全耦合设计的性能局限性，硬件软件共同设计已逐渐出现，以完全获得灵活设计空间的好处并优化神经网络性能。尽管如此，这种共同设计也将总搜索空间扩大到无穷大，并提出了重大挑战。虽然先前的研究一直集中在提高搜索效率（例如，通过增强学习），但它们通常依赖于整个架构加速器设计空间的共同搜索。在本文中，我们提出了一种\ emph {semi}的方法，以通过数量级来减少总设计空间的大小，但不会失去最优性。我们首先执行神经体系结构搜索，以获取一个候选加速器候选者的一小部分最佳体系结构。重要的是，这也是基于神经体系结构在不同加速器设计上的推理潜伏期和能源消耗的属性的属性基于属性的其他加速器设计的（近接近）最佳体系结构的集合。然后，我们没有考虑所有可能的体系结构，而是仅与这一小体系结构相结合，优化加速器设计，从而大大降低了总搜索成本。我们通过对具有不同数据流的加速器设计进行各种体系结构空间进行实验来验证我们的方法。我们的结果表明，我们只能通过减少的搜索空间进行导航，即可获得最佳设计。这项工作的源代码位于\ url {https://github.com/ren-research/codesign}。

In view of the performance limitations of fully-decoupled designs for neural architectures and accelerators, hardware-software co-design has been emerging to fully reap the benefits of flexible design spaces and optimize neural network performance. Nonetheless, such co-design also enlarges the total search space to practically infinity and presents substantial challenges. While the prior studies have been focusing on improving the search efficiency (e.g., via reinforcement learning), they commonly rely on co-searches over the entire architecture-accelerator design space. In this paper, we propose a \emph{semi}-decoupled approach to reduce the size of the total design space by orders of magnitude, yet without losing optimality. We first perform neural architecture search to obtain a small set of optimal architectures for one accelerator candidate. Importantly, this is also the set of (close-to-)optimal architectures for other accelerator designs based on the property that neural architectures' ranking orders in terms of inference latency and energy consumption on different accelerator designs are highly similar. Then, instead of considering all the possible architectures, we optimize the accelerator design only in combination with this small set of architectures, thus significantly reducing the total search cost. We validate our approach by conducting experiments on various architecture spaces for accelerator designs with different dataflows. Our results highlight that we can obtain the optimal design by only navigating over the reduced search space. The source code of this work is at \url{https://github.com/Ren-Research/CoDesign}.

下载PDF全文

下载文献需遵守相关版权规定

论文标题