内在维度解释了语言模型微调的有效性

论文标题

内在维度解释了语言模型微调的有效性

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

论文作者

Aghajanyan, Armen, Zettlemoyer, Luke, Gupta, Sonal

论文摘要

尽管验证的语言模型可以进行微调，从而为非常广泛的语言理解任务产生最先进的结果，但是该过程的动态尚不清楚，尤其是在低数据制度中。为什么我们可以使用相对较香草梯度下降算法（例如，没有强正道化）来调整数据集上数亿个参数的模型，其中只有数百或数千个标记的示例？在本文中，我们认为通过内在维度的镜头进行微调为我们提供了经验和理论直觉，以解释这一非凡现象。我们从经验上表明，常见的预训练模型具有非常低的固有维度。换句话说，存在一个低维度计的低维度计，对于整个参数空间而言，对微调也是有效的。例如，通过仅优化200个可随机投射回整个空间的可训练参数，我们可以调整Roberta模型以达到MRPC上的完整参数性能水平的90％。此外，我们从经验上表明，预培训隐式培训可以最大程度地减少固有维度，并且可能令人惊讶的是，较大的模型在固定数量的预训练更新后往往具有较低的固有维度，至少部分地解释了它们的极端有效性。最后，我们将固有维度与低维任务表示和基于压缩的概括范围连接起来，以提供基于内在的基于差异的概括界限，这些概括与完整的参数计数无关。

Although pretrained language models can be fine-tuned to produce state-of-the-art results for a very wide range of language understanding tasks, the dynamics of this process are not well understood, especially in the low data regime. Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples? In this paper, we argue that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions to explain this remarkable phenomenon. We empirically show that common pre-trained models have a very low intrinsic dimension; in other words, there exists a low dimension reparameterization that is as effective for fine-tuning as the full parameter space. For example, by optimizing only 200 trainable parameters randomly projected back into the full space, we can tune a RoBERTa model to achieve 90\% of the full parameter performance levels on MRPC. Furthermore, we empirically show that pre-training implicitly minimizes intrinsic dimension and, perhaps surprisingly, larger models tend to have lower intrinsic dimension after a fixed number of pre-training updates, at least in part explaining their extreme effectiveness. Lastly, we connect intrinsic dimensionality with low dimensional task representations and compression based generalization bounds to provide intrinsic-dimension-based generalization bounds that are independent of the full parameter count.

下载PDF全文

下载文献需遵守相关版权规定

论文标题