微调预训练的变压器成腐烂的快速重量

论文标题

微调预训练的变压器成腐烂的快速重量

Fine-Tuning Pre-trained Transformers into Decaying Fast Weights

论文作者

Mao, Huanru Henry

论文摘要

自回旋变压器是强大的语言模型，但由于自我发挥的机制而导致the生成期间的复杂性。最近的工作提出了基于内核的方法，通过用各种更新规则和特征图来替换复发法规，以实现O（1）时间和记忆复杂性，以近似因果自我注意力。我们探讨了这些方法，发现它们是不必要的复杂的，并提出了一种简单的选择 - 衰减的快重 - 在GPU上运行迅速，胜过先前的方法，并保留了GPT -2注意的99％的注意力表现。我们还在Wikitext-103上对更复杂的注意替代品展示了竞争性能。

Autoregressive Transformers are strong language models but incur O(T) complexity during per-token generation due to the self-attention mechanism. Recent work proposes kernel-based methods to approximate causal self-attention by replacing it with recurrent formulations with various update rules and feature maps to achieve O(1) time and memory complexity. We explore these approaches and find that they are unnecessarily complex, and propose a simple alternative - decaying fast weights - that runs fast on GPU, outperforms prior methods, and retains 99% of attention's performance for GPT-2. We also show competitive performance on WikiText-103 against more complex attention substitutes.

下载PDF全文

下载文献需遵守相关版权规定

论文标题