预处理而没有注意力

论文标题

预处理而没有注意力

Pretraining Without Attention

论文作者

Wang, Junxiong, Yan, Jing Nathan, Gu, Albert, Rush, Alexander M.

论文摘要

变压器对于在NLP中进行预处理至关重要。尽管已经使用了其他体系结构，但下游精度要么明显更差，要么需要注意层以匹配标准基准（例如胶水）。这项工作通过使用基于状态空间模型（SSM）的序列路由的最新进展来探索训练预处理。我们提出的模型，双向门控SSM（BIGS），将SSM层与多重门控结构相结合，该结构在简化的序列建模体系结构中有效。该模型学习不考虑配对相互作用的静态层。即便如此，Bigs还是能够匹配胶水上的BERT预处理的精度，并且可以扩展到4096代币的长形式预处理，而无需近似。分析表明，尽管模型的平均准确性相似，但在相互作用和句法表示方面，该方法具有与BERT不同的电感偏差。这项工作的所有模型均可在https://github.com/jxiw/bigs上获得。

Transformers have been essential to pretraining success in NLP. While other architectures have been used, downstream accuracy is either significantly worse, or requires attention layers to match standard benchmarks such as GLUE. This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs). Our proposed model, Bidirectional Gated SSM (BiGS), combines SSM layers with a multiplicative gating architecture that has been effective in simplified sequence modeling architectures. The model learns static layers that do not consider pair-wise interactions. Even so, BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation. Analysis shows that while the models have similar average accuracy, the approach has different inductive biases than BERT in terms of interactions and syntactic representations. All models from this work are available at https://github.com/jxiw/BiGS.

下载PDF全文

下载文献需遵守相关版权规定

论文标题