论文标题
平行性权衡:对数精确变压器的局限性
The Parallelism Tradeoff: Limitations of Log-Precision Transformers
论文作者
论文摘要
尽管它们在现代NLP方面无所不知,但表征变压器神经网的计算能力仍然是一个有趣的开放问题。我们证明,算术精度的变压器可以通过恒定的深度LogSpace-Surom均匀阈值电路来模拟输入令牌数量的对数(并且可以使用空间线性的馈电网计算)。这提供了使用复杂性理论的已知结果对变压器的力量的见解。例如,如果$ \ mathsf l \ neq \ mathsf p $(即,并非使用对数空间都可以解决所有多个时间问题),那么变形金刚甚至无法准确地求解线性平等或在任意无上下文的语法中与空产品中的无上下文语法中检查成员。我们的结果是从变压器体系结构的高可行性中出现的。因此,我们推测介绍了基本并行性权衡的思想:任何像变压器一样可行的模型架构都会遵守与之相似的限制。由于并行性是大规模训练模型的关键,因此这表明缩放范式的潜在固有弱点。
Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if $\mathsf L \neq \mathsf P$ (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture's high parallelizability. We thus speculatively introduce the idea of a fundamental parallelism tradeoff: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.