论文标题
Alexatm 20b:使用大型多语言SEQ2SEQ模型的学习很少
AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model
论文作者
论文摘要
在这项工作中,我们证明了多种语言大规模序列到序列(SEQ2SEQ)模型,该模型是通过Denoing和因果语言建模(CLM)任务的混合物进行训练的,比在各种任务上仅解码模型的模型更有效。特别是,我们训练称为Alexa教师模型(Alexatm 20B)的200亿个参数SEQ2SEQ模型,并表明它在1-Shot摘要任务上实现了最先进的(SOTA)性能,超过了更大的540B棕榈解码器模型。 Alexatm 20b还可以在1-Shot Machine翻译中实现SOTA,尤其是对于低资源语言,几乎所有语言对(阿拉伯语,英语,法语,德语,德语,印地语,意大利语,日语,马拉地语,马拉地语,葡萄牙语,西班牙语,泰米尔语和泰卢固语)在FLORES-101 DATASET上支持。我们还显示了零拍设置,AlexATM 20B在SuperGlue和SquadV2数据集上的表现优于GPT3(175B),并在XNLI,XCOPA,PAWS-X和XWINOGRAD等多语言任务上提供SOTA性能。总体而言,我们的结果为SEQ2SEQ模型提供了一个令人信服的案例,作为大型语言模型(LLM)培训的仅解码器模型的有力替代方法。
In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and show that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B PaLM decoder model. AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially for low-resource languages, across almost all language pairs supported by the model (Arabic, English, French, German, Hindi, Italian, Japanese, Marathi, Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show in zero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2 datasets and provides SOTA performance on multilingual tasks such as XNLI, XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling case for seq2seq models as a powerful alternative to decoder-only models for Large-scale Language Model (LLM) training.