N-ode变压器：使用神经普通微分方程的变压器的深度自适应变体

论文标题

N-ode变压器：使用神经普通微分方程的变压器的深度自适应变体

N-ODE Transformer: A Depth-Adaptive Variant of the Transformer Using Neural Ordinary Differential Equations

论文作者

Baier-Reinio, Aaron, De Sterck, Hans

论文摘要

我们使用神经常见的微分方程来制定变压器的变体，即深度自适应的意义是，普通的微分方程求解器采取了输入依赖性的时间步骤。我们提出N-ode变压器的目标是研究其深度适应性是否有助于克服处理非局部效应中变压器的某些特定已知的理论局限性。具体而言，我们考虑确定二进制序列的奇偶校验的简单问题，为此，标准变压器具有已知的局限性，只能通过使用足够数量的层或注意力头来克服。但是，我们发现N-Ode变压器的深度自适应并不能为奇偶校验问题的固有非本地性质提供补救，并为为什么这样做的解释提供了解释。接下来，我们通过惩罚ode轨迹的弧长来追求N-ode变压器的正则化，但发现这无法提高N-ode变压器在具有挑战性的奇偶校验问题上的准确性或效率。我们建议对N-ode变压器的修改和扩展的研究途径，这可能会提高序列建模任务（例如神经机器翻译）的准确性和效率。

We use neural ordinary differential equations to formulate a variant of the Transformer that is depth-adaptive in the sense that an input-dependent number of time steps is taken by the ordinary differential equation solver. Our goal in proposing the N-ODE Transformer is to investigate whether its depth-adaptivity may aid in overcoming some specific known theoretical limitations of the Transformer in handling nonlocal effects. Specifically, we consider the simple problem of determining the parity of a binary sequence, for which the standard Transformer has known limitations that can only be overcome by using a sufficiently large number of layers or attention heads. We find, however, that the depth-adaptivity of the N-ODE Transformer does not provide a remedy for the inherently nonlocal nature of the parity problem, and provide explanations for why this is so. Next, we pursue regularization of the N-ODE Transformer by penalizing the arclength of the ODE trajectories, but find that this fails to improve the accuracy or efficiency of the N-ODE Transformer on the challenging parity problem. We suggest future avenues of research for modifications and extensions of the N-ODE Transformer that may lead to improved accuracy and efficiency for sequence modelling tasks such as neural machine translation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题