论文标题
数据分布属性驱动变压器中出现的秘密学习
Data Distributional Properties Drive Emergent In-Context Learning in Transformers
论文作者
论文摘要
大型基于变压器的模型能够在未经明确培训的情况下执行几次学习的学习。该观察结果提出了一个问题:训练制度的哪些方面导致这种紧急行为?在这里,我们表明这种行为是由培训数据本身的分布驱动的。当训练数据表现出特定的分布属性,例如爆发(项目出现在群集中,而不是随着时间的推移会随着时间的推移而均匀分布),并且很少发生类别时,就会出现在上下文的学习。当项目含义或解释是动态而不是固定的时,也会更强烈地出现学习中的学习。这些特性以自然语言为例,但也是许多其他领域中自然主义数据固有的。他们也显着脱离制服,I.I.D。通常用于标准监督学习的培训分布。在我们的最初实验中,我们发现在基于更常规的基于体重的学习与更常规的学习进行了交易,并且模型无法同时实现这两者。但是,我们后来的实验发现了两种学习模式可以在偏斜的Zipfian分布之后对数据进行培训时,可以在单个模型中共存,这是自然主义数据的另一种共同属性,包括语言。在进一步的实验中,我们发现自然主义数据分布只能在变压器中而不是在经常性模型中引起文化学习。总而言之,我们的发现表明了变压器体系结构如何与培训数据的特定属性一起运作,以推动大型语言模型的有趣的新兴内在学习行为,以及未来的工作如何鼓励在语言以外的域中的界面和重点学习。
Large transformer-based models are able to perform in-context few-shot learning, without being explicitly trained for it. This observation raises the question: what aspects of the training regime lead to this emergent behavior? Here, we show that this behavior is driven by the distributions of the training data itself. In-context learning emerges when the training data exhibits particular distributional properties such as burstiness (items appear in clusters rather than being uniformly distributed over time) and having large numbers of rarely occurring classes. In-context learning also emerges more strongly when item meanings or interpretations are dynamic rather than fixed. These properties are exemplified by natural language, but are also inherent to naturalistic data in a wide range of other domains. They also depart significantly from the uniform, i.i.d. training distributions typically used for standard supervised learning. In our initial experiments, we found that in-context learning traded off against more conventional weight-based learning, and models were unable to achieve both simultaneously. However, our later experiments uncovered that the two modes of learning could co-exist in a single model when it was trained on data following a skewed Zipfian distribution -- another common property of naturalistic data, including language. In further experiments, we found that naturalistic data distributions were only able to elicit in-context learning in transformers, and not in recurrent models. In sum, our findings indicate how the transformer architecture works together with particular properties of the training data to drive the intriguing emergent in-context learning behaviour of large language models, and how future work might encourage both in-context and in-weights learning in domains beyond language.