Maple-Edge：边缘设备的运行时延迟预测变量

论文标题

Maple-Edge：边缘设备的运行时延迟预测变量

MAPLE-Edge: A Runtime Latency Predictor for Edge Devices

论文作者

Nair, Saeejith, Abbasi, Saad, Wong, Alexander, Shafiee, Mohammad Javad

论文摘要

神经体系结构搜索（NAS）已自动发现更有效的神经网络体系结构，尤其是对于移动和嵌入式视觉应用程序。尽管最近的研究提出了快速估算只有几个样本的看不见硬件设备的潜伏期的方法，但使用优化的图形（例如Tensorrt，例如Edge设备）估算运行时间的潜伏期的挑战很少。在这项工作中，我们提出了Maple-Edge，这是一种以边缘设备为导向的枫树的扩展，这是通用硬件的最先进的延迟预测指标，在该型号的延迟预测器中，我们在该架构延迟对上训练了一个回归网络，并与硬件 - 运输描述符结合使用，以有效地估算各种边缘设备池上的延迟。与Maple相比，枫树边缘可以使用一组较小的CPU性能计数器来描述运行时和目标设备平台，这些计数器在所有Linux内核上都广泛使用，同时仅使用未播放的目标设备的10个测量值来实现优化的Edge设备运行时的先前最新的基线方法，但在优化的Edge设备运行中的先前最新基线方法达到 +49.6％的准确性提高。我们还证明，与在共享共同运行时训练的设备培训时，枫木的表现最佳，枫树边缘可以通过在测量的硬件 - 倒计时描述符中使用操作员延迟标准化的性能计数器来有效地在整个运行时概括。最后，我们表明，对于表现出低于所需精度的运行时，可以通过从目标设备中收集其他样本来提高性能，额外的90个样本转化为接近 +40％。

Neural Architecture Search (NAS) has enabled automatic discovery of more efficient neural network architectures, especially for mobile and embedded vision applications. Although recent research has proposed ways of quickly estimating latency on unseen hardware devices with just a few samples, little focus has been given to the challenges of estimating latency on runtimes using optimized graphs, such as TensorRT and specifically for edge devices. In this work, we propose MAPLE-Edge, an edge device-oriented extension of MAPLE, the state-of-the-art latency predictor for general purpose hardware, where we train a regression network on architecture-latency pairs in conjunction with a hardware-runtime descriptor to effectively estimate latency on a diverse pool of edge devices. Compared to MAPLE, MAPLE-Edge can describe the runtime and target device platform using a much smaller set of CPU performance counters that are widely available on all Linux kernels, while still achieving up to +49.6% accuracy gains against previous state-of-the-art baseline methods on optimized edge device runtimes, using just 10 measurements from an unseen target device. We also demonstrate that unlike MAPLE which performs best when trained on a pool of devices sharing a common runtime, MAPLE-Edge can effectively generalize across runtimes by applying a trick of normalizing performance counters by the operator latency, in the measured hardware-runtime descriptor. Lastly, we show that for runtimes exhibiting lower than desired accuracy, performance can be boosted by collecting additional samples from the target device, with an extra 90 samples translating to gains of nearly +40%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题