M $^3 $ VIT：Experts Vision Transformer的混合物，用于高效多任务学习与模型加速器共同设计

论文标题

M $^3 $ VIT：Experts Vision Transformer的混合物，用于高效多任务学习与模型加速器共同设计

M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design

论文作者

Liang, Hanxue, Fan, Zhiwen, Sarkar, Rishov, Jiang, Ziyu, Chen, Tianlong, Zou, Kai, Cheng, Yu, Hao, Cong, Wang, Zhangyang

论文摘要

多任务学习（MTL）将多个学习的任务封装在单个模型中，并且经常让这些任务更好地共同学习。但是，当将MTL部署到通常受资源受限或对延迟敏感的那些实际系统上时，会出现两个突出的挑战：（i）在培训期间，由于跨任务的梯度冲突，同时优化所有任务通常很困难；（ii）在推断时，当前的MTL制度甚至必须激活整个模型，即使仅执行一个任务。然而，大多数实际系统在每时刻只需要一个或两个任务，并且根据需要在任务之间切换：因此，激活的推理所有任务也非常低效率和不可降低。在本文中，我们提出了一个模型加速器共同设计框架，以实现有效的磁性磁盘MTL。我们的框架称为M $^3 $ vit，将Experts（MOE）层定制为MTL的视觉变压器（VIT）骨架，并在训练过程中稀少地激活特定于任务的专家。然后，在推断任何感兴趣的任务时，相同的设计允许仅激活任务处理的稀疏专家途径，而不是完整的模型。硬件级创新，特别是针对内存受限的MTL量身定制的新型计算计划进一步增强了我们的新模型设计，该方案在任务之间实现零交叉切换，并且可以扩展到任何数量的专家。在执行单任务推理时，M $^{3} $ vit的精度比以编码为注重的MTL方法更高，同时显着降低了88％的推理拖鞋。当在一个Xilinx ZCU104 FPGA的硬件平台上实施时，我们的共同设计框架将内存需求降低了2.4倍，而实现能源效率的9.23倍高达9.23倍，高达9.23倍。代码可在以下网址提供：https：//github.com/vita-group/m3vit。

Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly. However, when deploying MTL onto those real-world systems that are often resource-constrained or latency-sensitive, two prominent challenges arise: (i) during training, simultaneously optimizing all tasks is often difficult due to gradient conflicts across tasks; (ii) at inference, current MTL regimes have to activate nearly the entire model even to just execute a single task. Yet most real systems demand only one or two tasks at each moment, and switch between tasks as needed: therefore such all tasks activated inference is also highly inefficient and non-scalable. In this paper, we present a model-accelerator co-design framework to enable efficient on-device MTL. Our framework, dubbed M$^3$ViT, customizes mixture-of-experts (MoE) layers into a vision transformer (ViT) backbone for MTL, and sparsely activates task-specific experts during training. Then at inference with any task of interest, the same design allows for activating only the task-corresponding sparse expert pathway, instead of the full model. Our new model design is further enhanced by hardware-level innovations, in particular, a novel computation reordering scheme tailored for memory-constrained MTL that achieves zero-overhead switching between tasks and can scale to any number of experts. When executing single-task inference, M$^{3}$ViT achieves higher accuracies than encoder-focused MTL methods, while significantly reducing 88% inference FLOPs. When implemented on a hardware platform of one Xilinx ZCU104 FPGA, our co-design framework reduces the memory requirement by 2.4 times, while achieving energy efficiency up to 9.23 times higher than a comparable FPGA baseline. Code is available at: https://github.com/VITA-Group/M3ViT.

下载PDF全文

下载文献需遵守相关版权规定

论文标题