论文标题
极端地区的交叉验证
Cross-validation on Extreme Regions
论文作者
论文摘要
我们对学习算法的概括风险(CV)进行了非渐近研究(CV),该算法专用于协变量空间的极端区域。在这种极端价值分析的上下文中,鉴于输入的规范超过了高分位数,风险函数可以测量算法的错误。该框架内的主要挑战是相对于完整的样本量的极端训练样本的尺寸可忽略不计,以及通过趋势趋于零的概率来重新规模风险功能的必要性。我们通过建立两个新的结果开辟了对极端值的有限样本理解的有限样本理解的道路:指数概率绑定在\ kfold CV误差上,而多项式概率绑定在保留 - \ textrm {p} -out cv上。我们的界限很明显,因为它们匹配标准简历估计的最新保证,同时将其扩展到涵盖了概率很小的调理事件。我们通过套索型逻辑回归算法说明了我们对极端区域中高维分类的重要性的重要性。在数值实验中研究了我们边界的紧密度。
We conduct a non asymptotic study of the Cross Validation (CV) estimate of the generalization risk for learning algorithms dedicated to extreme regions of the covariates space. In this Extreme Value Analysis context, the risk function measures the algorithm's error given that the norm of the input exceeds a high quantile. The main challenge within this framework is the negligible size of the extreme training sample with respect to the full sample size and the necessity to re-scale the risk function by a probability tending to zero. We open the road to a finite sample understanding of CV for extreme values by establishing two new results: an exponential probability bound on the \Kfold CV error and a polynomial probability bound on the leave-\textrm{p}-out CV. Our bounds are sharp in the sense that they match state-of-the-art guarantees for standard CV estimates while extending them to encompass a conditioning event of small probability. We illustrate the significance of our results regarding high dimensional classification in extreme regions via a Lasso-type logistic regression algorithm. The tightness of our bounds is investigated in numerical experiments.