论文标题
STI:通过弹性管道上的涡轮增压NLP推断
STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining
论文作者
论文摘要
自然语言处理(NLP)推论正在看到移动应用程序采用的提高,在此,对于至关重要的保留用户数据隐私和避免网络往返的推论是理想的。但是,NLP模型的前所未有的大小强调延迟和内存,在移动设备的两个关键资源之间产生张力。为了满足目标延迟,将整个模型保存在内存中会尽快启动执行,但将一个应用程序的内存足迹增加了几次,将其收益限制为仅在被移动内存管理回收之前的一些推论。另一方面,只要几秒钟就会从存储按需加载模型,就远远超过了用户满足的延迟范围;由于IO和计算延迟之间的偏斜度很高,因此管道上的LayerWise模型加载和执行也不会隐藏IO。 为此,我们提出了Speedy Transformer推断(STI)。 STI建立在最大化IO/计算资源利用率的关键思想基础上,STI调解了潜伏期V.S.通过两种新技术的记忆张力。首先,模型碎片。 STI将模型参数视为独立可调的碎片,并介绍了其对准确性的重要性。其次,带有预紧缓冲液的弹性管道计划。 STI实例化IO/Compute管道,并使用小型缓冲液进行预加载碎片来进行引导执行,而不会在早期阶段停滞不前;它根据资源弹性执行的重要性明智地选择,调音和汇编碎片,从而最大程度地提高推理精度。 在两个商品SoC上,我们在实用的目标潜伏期以及CPU和GPU上建立了STI并根据广泛的NLP任务进行评估。我们证明,STI提供高准确性,其记忆力低1-2个数量级,表现优于竞争基线。
Natural Language Processing (NLP) inference is seeing increasing adoption by mobile applications, where on-device inference is desirable for crucially preserving user data privacy and avoiding network roundtrips. Yet, the unprecedented size of an NLP model stresses both latency and memory, creating a tension between the two key resources of a mobile device. To meet a target latency, holding the whole model in memory launches execution as soon as possible but increases one app's memory footprints by several times, limiting its benefits to only a few inferences before being recycled by mobile memory management. On the other hand, loading the model from storage on demand incurs IO as long as a few seconds, far exceeding the delay range satisfying to a user; pipelining layerwise model loading and execution does not hide IO either, due to the high skewness between IO and computation delays. To this end, we propose Speedy Transformer Inference (STI). Built on the key idea of maximizing IO/compute resource utilization on the most important parts of a model, STI reconciles the latency v.s. memory tension via two novel techniques. First, model sharding. STI manages model parameters as independently tunable shards, and profiles their importance to accuracy. Second, elastic pipeline planning with a preload buffer. STI instantiates an IO/compute pipeline and uses a small buffer for preload shards to bootstrap execution without stalling at early stages; it judiciously selects, tunes, and assembles shards per their importance for resource-elastic execution, maximizing inference accuracy. Atop two commodity SoCs, we build STI and evaluate it against a wide range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU. We demonstrate that STI delivers high accuracies with 1-2 orders of magnitude lower memory, outperforming competitive baselines.