对未标记数据的深网进行自我训练的理论分析

论文标题

对未标记数据的深网进行自我训练的理论分析

Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data

论文作者

Wei, Colin, Shen, Kendrick, Chen, Yining, Ma, Tengyu

论文摘要

自我训练算法训练模型以适合另一个先前学习的模型预测的伪标记，它在使用神经网络的未标记数据中学习非常成功。但是，当前对自我训练的理论理解仅适用于线性模型。这项工作提供了对自我训练的统一理论分析，并通过深层网络进行半监督学习，无监督的领域适应和无监督的学习。我们分析的核心是一个简单但现实的“扩展”假设，该假设指出，数据的概率较低，必须扩展到相对于子集的概率较大的邻域。我们还假设不同类别的示例社区的重叠最小。我们证明，在这些假设下，基于自我训练和输入矛盾的正则化的人口目标的最小化将相对于地面真相标签获得很高的准确性。通过使用现成的概括边界，我们立即将此结果转换为样本的复杂性保证了在边缘和Lipschitzness中多项式的神经网。我们的结果有助于解释使用输入一致性正则化的最近提出的自我训练算法的经验成功。

Self-training algorithms, which train a model to fit pseudolabels predicted by another previously-learned model, have been very successful for learning with unlabeled data using neural networks. However, the current theoretical understanding of self-training only applies to linear models. This work provides a unified theoretical analysis of self-training with deep networks for semi-supervised learning, unsupervised domain adaptation, and unsupervised learning. At the core of our analysis is a simple but realistic "expansion" assumption, which states that a low probability subset of the data must expand to a neighborhood with large probability relative to the subset. We also assume that neighborhoods of examples in different classes have minimal overlap. We prove that under these assumptions, the minimizers of population objectives based on self-training and input-consistency regularization will achieve high accuracy with respect to ground-truth labels. By using off-the-shelf generalization bounds, we immediately convert this result to sample complexity guarantees for neural nets that are polynomial in the margin and Lipschitzness. Our results help explain the empirical successes of recently proposed self-training algorithms which use input consistency regularization.

下载PDF全文

下载文献需遵守相关版权规定

论文标题