论文标题
停止浪费我的时间!节省了Imagenet和Bert训练的日子,并以最新的重量平均
Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging
论文作者
论文摘要
大型数据集上的培训视觉或语言模型可能需要几天(即使不是几周)。我们表明,平均数十个时期收集的K最新检查站的权重可以加快训练进度,数十个时期的损失和准确性,相当于在ImagEnet和Roberta-Base上训练Resnet50在Wikitex-Base上训练时,可以节省〜68和〜30 GPU小时。我们还提供代码和模型检查点轨迹,以重现结果并促进有关重新使用历史权重的研究以获得更快的收敛。
Training vision or language models on large datasets can take days, if not weeks. We show that averaging the weights of the k latest checkpoints, each collected at the end of an epoch, can speed up the training progression in terms of loss and accuracy by dozens of epochs, corresponding to time savings up to ~68 and ~30 GPU hours when training a ResNet50 on ImageNet and RoBERTa-Base model on WikiText-103, respectively. We also provide the code and model checkpoint trajectory to reproduce the results and facilitate research on reusing historical weights for faster convergence.