论文标题
概率依赖性梯度衰减在较大的边缘软磁
Probability-Dependent Gradient Decay in Large Margin Softmax
论文作者
论文摘要
在过去的几年中,SoftMax已成为神经网络框架中的共同组成部分。在本文中,在SoftMax中引入了梯度衰减高参数,以控制训练期间概率依赖性梯度衰减率。通过遵循在MNIST,CIFAR-10/100和SVHN训练的多种模型体系结构的理论分析和经验结果,我们发现概括性能显着取决于梯度衰减速率,因为置信概率上升,即梯度会随着样品概率的增加而呈侧面或浓度。此外,使用小梯度衰减的优化显示了类似的课程学习顺序,仅在实现了轻松样品后,硬样品才在聚光灯下被焦点,并且分离良好的样品获得了更高的梯度以降低阶层内距离。基于分析结果,我们可以提供证据表明,较大的边缘软智能将通过调节概率依赖性梯度衰减率来影响损耗函数的局部Lipschitz限制。本文通过分析梯度衰减率,提供了新的观点和了解大幅度软性软件,本地Lipschitz约束和课程学习的概念之间的关系。此外,我们提出了一种热身策略,以动态调整训练中的软效果损失,在这种训练中,梯度衰减率从过度降低到加快收敛速率。
In the past few years, Softmax has become a common component in neural network frameworks. In this paper, a gradient decay hyperparameter is introduced in Softmax to control the probability-dependent gradient decay rate during training. By following the theoretical analysis and empirical results of a variety of model architectures trained on MNIST, CIFAR-10/100 and SVHN, we find that the generalization performance depends significantly on the gradient decay rate as the confidence probability rises, i.e., the gradient decreases convexly or concavely as the sample probability increases. Moreover, optimization with the small gradient decay shows a similar curriculum learning sequence where hard samples are in the spotlight only after easy samples are convinced sufficiently, and well-separated samples gain a higher gradient to reduce intra-class distance. Based on the analysis results, we can provide evidence that the large margin Softmax will affect the local Lipschitz constraint of the loss function by regulating the probability-dependent gradient decay rate. This paper provides a new perspective and understanding of the relationship among concepts of large margin Softmax, local Lipschitz constraint and curriculum learning by analyzing the gradient decay rate. Besides, we propose a warm-up strategy to dynamically adjust Softmax loss in training, where the gradient decay rate increases from over-small to speed up the convergence rate.