论文标题

GPTQ:生成预训练的变压器的准确训练后量化

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

论文作者

Frantar, Elias, Ashkboos, Saleh, Hoefler, Torsten, Alistarh, Dan

论文摘要

生成的预训练的变压器模型(称为GPT或OPT)通过在复杂的语言建模任务中进行突破性的性能,以及它们的计算和存储成本极高,从而使自己与众不同。具体而言,由于它们的尺寸庞大,甚至对大型,高度准确的GPT模型的推断也可能需要多个性能GPU,这限制了此类模型的可用性。尽管正在通过模型压缩来缓解这种压力,但现有压缩技术的适用性和性能受到GPT模型的规模和复杂性的限制。在本文中,我们解决了这一挑战,并提出了GPTQ,这是一种基于近似二阶信息的新的单次权重量化方法,既高度准确又高效。具体而言,GPTQ可以在大约四个GPU小时内用1750亿个参数量化GPT模型,从而将位宽降低到每个重量的3或4位,相对于未压缩的基线而言,准确性降解可忽略不计。我们的方法相对于先前提供的一击量化方法的压缩增益增加了一倍,从而保持准确性,从而使我们首次在单个GPU内部执行1750亿参数模型以进行生成推断。此外,我们还表明,我们的方法仍然可以在极端量化制度中提供合理的精度,其中权重量化为2位甚至三元量化水平。我们通过实验表明,在使用高端GPU(NVIDIA A100)(NVIDIA A100)和4.5倍时,可以利用这些改进来利用FP16的端到端推理加速度,当使用更具成本效益的速度(NVIDIA A6000)。该实现可在https://github.com/ist-daslab/gptq上获得。

Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models. While there is emerging work on relieving this pressure via model compression, the applicability and performance of existing compression techniques is limited by the scale and complexity of GPT models. In this paper, we address this challenge, and propose GPTQ, a new one-shot weight quantization method based on approximate second-order information, that is both highly-accurate and highly-efficient. Specifically, GPTQ can quantize GPT models with 175 billion parameters in approximately four GPU hours, reducing the bitwidth down to 3 or 4 bits per weight, with negligible accuracy degradation relative to the uncompressed baseline. Our method more than doubles the compression gains relative to previously-proposed one-shot quantization methods, preserving accuracy, allowing us for the first time to execute an 175 billion-parameter model inside a single GPU for generative inference. Moreover, we also show that our method can still provide reasonable accuracy in the extreme quantization regime, in which weights are quantized to 2-bit or even ternary quantization levels. We show experimentally that these improvements can be leveraged for end-to-end inference speedups over FP16, of around 3.25x when using high-end GPUs (NVIDIA A100) and 4.5x when using more cost-effective ones (NVIDIA A6000). The implementation is available at https://github.com/IST-DASLab/gptq.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源