迷你批量大小对随机梯度下降中梯度方差的影响

论文标题

迷你批量大小对随机梯度下降中梯度方差的影响

The Impact of the Mini-batch Size on the Variance of Gradients in Stochastic Gradient Descent

论文作者

Qian, Xin, Klabjan, Diego

论文摘要

迷你批次随机梯度下降（SGD）算法广泛用于训练机学习模型，特别是深度学习模型。我们通过关注梯度的差异，研究了线性回归和两层线性网络下的SGD动力学，并轻松扩展到更深的线性网络，这是对该性质的首次研究。在线性回归案例中，我们表明，在每次迭代中，梯度的规范是小批量尺寸$ b $的降低函数，因此随机梯度估计器的差异是$ b $的函数降低。对于具有$ L_2 $损失的深神经网络，我们表明梯度的差异为$ 1/b $的多项式。结果追溯了重要的直觉，即较小的批量大小产生较低的损失函数值，这在研究人员中很普遍。证明技术表现出随机梯度估计器与初始权重之间的关系，这对于进一步研究SGD的动力学很有用。我们从经验上为各种数据集和常用的深层网络结构提供了进一步的见解。

The mini-batch stochastic gradient descent (SGD) algorithm is widely used in training machine learning models, in particular deep learning models. We study SGD dynamics under linear regression and two-layer linear networks, with an easy extension to deeper linear networks, by focusing on the variance of the gradients, which is the first study of this nature. In the linear regression case, we show that in each iteration the norm of the gradient is a decreasing function of the mini-batch size $b$ and thus the variance of the stochastic gradient estimator is a decreasing function of $b$. For deep neural networks with $L_2$ loss we show that the variance of the gradient is a polynomial in $1/b$. The results back the important intuition that smaller batch sizes yield lower loss function values which is a common believe among the researchers. The proof techniques exhibit a relationship between stochastic gradient estimators and initial weights, which is useful for further research on the dynamics of SGD. We empirically provide further insights to our results on various datasets and commonly used deep network structures.

下载PDF全文

下载文献需遵守相关版权规定

论文标题