神经力学：深度学习动态中的对称和损坏的保护法

论文标题

神经力学：深度学习动态中的对称和损坏的保护法

Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics

论文作者

Kunin, Daniel, Sagastuy-Brena, Javier, Ganguli, Surya, Yamins, Daniel L. K., Tanaka, Hidenori

论文摘要

了解培训期间神经网络参数的动态是建立深度学习理论基础的关键挑战之一。一个核心障碍是，高维参数空间中网络的运动沿着来自现实世界数据集的复杂随机梯度进行离散步骤。我们通过基于嵌入在网络体系结构中的内在对称性的统一理论框架来规避这一障碍，这些框架均存在于任何数据集中。我们表明，任何此类对称性都对梯度和黑森人施加了严格的几何约束，从而在随机梯度下降的连续时间限制（SGD）中导致相关的保护定律，类似于物理学定理。我们进一步表明，实践中使用的有限学习率实际上可以打破这些对称诱导的保护法。我们将有限差异方法的工具应用于得出修改的梯度流，这是一种微分方程，可以更好地近似于以有限学习速率采用SGD采取的数值轨迹。我们将修改的梯度流与对称的框架相结合，从而为某些参数组合的动力学得出精确的积分表达式。我们从经验上验证了我们的分析表达方式，以在小型成像网受培训的VGG-16上学习动力学。总体而言，通过利用对称性，我们的工作表明，我们可以分析以有限的学习速率和批处理大小来描述各种参数组合的学习动力，以在任何数据集中训练有素的最先进的体系结构。

Understanding the dynamics of neural network parameters during training is one of the key challenges in building a theoretical foundation for deep learning. A central obstacle is that the motion of a network in high-dimensional parameter space undergoes discrete finite steps along complex stochastic gradients derived from real-world datasets. We circumvent this obstacle through a unifying theoretical framework based on intrinsic symmetries embedded in a network's architecture that are present for any dataset. We show that any such symmetry imposes stringent geometric constraints on gradients and Hessians, leading to an associated conservation law in the continuous-time limit of stochastic gradient descent (SGD), akin to Noether's theorem in physics. We further show that finite learning rates used in practice can actually break these symmetry induced conservation laws. We apply tools from finite difference methods to derive modified gradient flow, a differential equation that better approximates the numerical trajectory taken by SGD at finite learning rates. We combine modified gradient flow with our framework of symmetries to derive exact integral expressions for the dynamics of certain parameter combinations. We empirically validate our analytic expressions for learning dynamics on VGG-16 trained on Tiny ImageNet. Overall, by exploiting symmetry, our work demonstrates that we can analytically describe the learning dynamics of various parameter combinations at finite learning rates and batch sizes for state of the art architectures trained on any dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题