论文标题
记忆不适合:分析大语言模型的训练动力
Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models
论文作者
论文摘要
尽管采用了广泛的采用,但尚未很好地理解非常大的语言模型的基本培训和记忆动态。我们在因果关系和蒙版语言建模,模型尺寸以及整个培训过程中的因果和蒙版语言建模中的经验研究。我们衡量数据集大小,学习率和模型大小对记忆的影响,发现较大的语言模型在所有设置中都更快地记住培训数据。出乎意料的是,我们表明,在过度拟合之前,较大的模型可以记住大部分数据,并且在整个训练过程中往往会更少忘记。我们还分析了语音不同部分的记忆动力学,并发现模型首先记住名词和数字。我们假设并提供了经验证据,表明名词和数字是记忆单个培训示例的独特标识符。这些发现在一起,提出了一个更广泛的难题,试图了解随着模型的变化而真正改善的事物。
Despite their wide adoption, the underlying training and memorization dynamics of very large language models is not well understood. We empirically study exact memorization in causal and masked language modeling, across model sizes and throughout the training process. We measure the effects of dataset size, learning rate, and model size on memorization, finding that larger language models memorize training data faster across all settings. Surprisingly, we show that larger models can memorize a larger portion of the data before over-fitting and tend to forget less throughout the training process. We also analyze the memorization dynamics of different parts of speech and find that models memorize nouns and numbers first; we hypothesize and provide empirical evidence that nouns and numbers act as a unique identifier for memorizing individual training examples. Together, these findings present another piece of the broader puzzle of trying to understand what actually improves as models get bigger.