论文标题
关于深度学习的概括之谜
On the Generalization Mystery in Deep Learning
论文作者
论文摘要
深度学习中的概括之谜如下:为什么通过梯度下降(GD)训练的过度参数化神经网络在实际数据集上概括了,即使它们能够安装出可比大小的随机数据集?此外,从适合培训数据的所有解决方案中,GD如何找到良好的概括(当存在良好的解决方案时)?我们认为,两个问题的答案在于训练过程中不同例子的梯度的相互作用。直观地,如果每个示例梯度良好,也就是说,如果它们是连贯的,那么人们可能希望GD(算法)稳定,因此可以很好地推广。我们以易于计算和可解释的指标形式化了这一论点,并表明该指标在几个通用视觉网络的真实和随机数据集上具有非常不同的值。该理论还解释了深度学习中的许多其他现象,例如为什么某些示例比其他例子更早地学习,为什么早期停止工作以及为什么有可能从嘈杂的标签中学习。此外,由于该理论提供了有关GD在存在时如何找到良好的解决方案的因果解释,因此它激发了对GD的一类简单修改,从而减轻了记忆并改善概括。深度学习中的概括是一种极为广泛的现象,因此,它需要同样一般的解释。我们以对此问题的替代攻击线进行调查结束,并认为拟议的方法是此基础上最可行的方法。
The generalization mystery in deep learning is the following: Why do over-parameterized neural networks trained with gradient descent (GD) generalize well on real datasets even though they are capable of fitting random datasets of comparable size? Furthermore, from among all solutions that fit the training data, how does GD find one that generalizes well (when such a well-generalizing solution exists)? We argue that the answer to both questions lies in the interaction of the gradients of different examples during training. Intuitively, if the per-example gradients are well-aligned, that is, if they are coherent, then one may expect GD to be (algorithmically) stable, and hence generalize well. We formalize this argument with an easy to compute and interpretable metric for coherence, and show that the metric takes on very different values on real and random datasets for several common vision networks. The theory also explains a number of other phenomena in deep learning, such as why some examples are reliably learned earlier than others, why early stopping works, and why it is possible to learn from noisy labels. Moreover, since the theory provides a causal explanation of how GD finds a well-generalizing solution when one exists, it motivates a class of simple modifications to GD that attenuate memorization and improve generalization. Generalization in deep learning is an extremely broad phenomenon, and therefore, it requires an equally general explanation. We conclude with a survey of alternative lines of attack on this problem, and argue that the proposed approach is the most viable one on this basis.