用于分类问题的块坐标下降优化器利用凸度

论文标题

用于分类问题的块坐标下降优化器利用凸度

A block coordinate descent optimizer for classification problems exploiting convexity

论文作者

Patel, Ravi G., Trask, Nathaniel A., Gulian, Mamikon A., Cyr, Eric C.

论文摘要

二阶优化器具有深入学习的吸引力，但与基于梯度的方法相比，成本增加和对损失表面的非转化性的敏感性。我们介绍了一种坐标下降方法，以训练深神网络进行分类任务，以利用线性层的重量中跨透明镜损失的全局凸度。我们的混合牛顿/梯度下降（NGD）方法与对隐藏层的解释为提供自适应基础和线性层是一致的，作为为数据提供基础的最佳拟合。通过在二阶方法之间交替，以找到线性层和梯度下降以训练隐藏层的全球最佳参数，我们确保自适应基础的最佳拟合到整个训练中的数据。二阶步长中的黑森的大小仅在线性层中的数量重量，而不是隐藏层的深度和宽度；此外，该方法适用于任意隐藏层体系结构。对回归问题进行这种自适应基础观点的先前工作表明，在降低培训成本时的准确性有了显着提高，并且可以将此工作视为这种分类问题的扩展。我们首先证明由此产生的Hessian矩阵是对称的半定义，并且Newton Step实现了全球最小化器。通过研究制造的二维点云数据的分类，我们既展示了验证误差的改善，又显示了使用NGD训练的隐藏层中编码的基础函数的明显定性差异。对图像分类基准的应用在致密和卷积体系结构上都显示出提高的训练精度，这表明二阶方法可能比梯度下降的增长。

Second-order optimizers hold intriguing potential for deep learning, but suffer from increased cost and sensitivity to the non-convexity of the loss surface as compared to gradient-based approaches. We introduce a coordinate descent method to train deep neural networks for classification tasks that exploits global convexity of the cross-entropy loss in the weights of the linear layer. Our hybrid Newton/Gradient Descent (NGD) method is consistent with the interpretation of hidden layers as providing an adaptive basis and the linear layer as providing an optimal fit of the basis to data. By alternating between a second-order method to find globally optimal parameters for the linear layer and gradient descent to train the hidden layers, we ensure an optimal fit of the adaptive basis to data throughout training. The size of the Hessian in the second-order step scales only with the number weights in the linear layer and not the depth and width of the hidden layers; furthermore, the approach is applicable to arbitrary hidden layer architecture. Previous work applying this adaptive basis perspective to regression problems demonstrated significant improvements in accuracy at reduced training cost, and this work can be viewed as an extension of this approach to classification problems. We first prove that the resulting Hessian matrix is symmetric semi-definite, and that the Newton step realizes a global minimizer. By studying classification of manufactured two-dimensional point cloud data, we demonstrate both an improvement in validation error and a striking qualitative difference in the basis functions encoded in the hidden layer when trained using NGD. Application to image classification benchmarks for both dense and convolutional architectures reveals improved training accuracy, suggesting possible gains of second-order methods over gradient descent.

下载PDF全文

下载文献需遵守相关版权规定

论文标题