指定规定在现代机器学习中提出了信誉的挑战

论文标题

指定规定在现代机器学习中提出了信誉的挑战

Underspecification Presents Challenges for Credibility in Modern Machine Learning

论文作者

D'Amour, Alexander, Heller, Katherine, Moldovan, Dan, Adlam, Ben, Alipanahi, Babak, Beutel, Alex, Chen, Christina, Deaton, Jonathan, Eisenstein, Jacob, Hoffman, Matthew D., Hormozdiari, Farhad, Houlsby, Neil, Hou, Shaobo, Jerfel, Ghassen, Karthikesalingam, Alan, Lucic, Mario, Ma, Yian, McLean, Cory, Mincu, Diana, Mitani, Akinori, Montanari, Andrea, Nado, Zachary, Natarajan, Vivek, Nielson, Christopher, Osborne, Thomas F., Raman, Rajiv, Ramasamy, Kim, Sayres, Rory, Schrouff, Jessica, Seneviratne, Martin, Sequeira, Shannon, Suresh, Harini, Veitch, Victor, Vladymyrov, Max, Wang, Xuezhi, Webster, Kellie, Yadlowsky, Steve, Yun, Taedong, Zhai, Xiaohua, Sculley, D.

论文摘要

当ML模型部署在实际领域时，它们通常表现出意外的差行为。我们将规定视为这些失败的关键原因。当ML管道可以在训练域中返回许多相当强的预期性能时，它会明确指定。在现代ML管道中（例如基于深度学习的）中，规定很常见。根据其训练域的性能，通常将未指定管道返回的预测变量被视为等效，但我们在这里表明，这种预测因子在部署域中的行为可能会大不相同。这种歧义可以导致实践中的不稳定和模型行为差，并且是与先前确定的问题不同的失败模式，这是由于训练和部署域之间的结构不匹配而引起的。我们表明，这个问题出现在各种实用的ML管道中，使用计算机视觉，医学成像，自然语言处理，基于电子健康记录的临床风险预测和医学基因组学预测。我们的结果表明，有必要在建模旨在在任何域中进行现实部署的管道中明确说明指定的指定。

ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.

下载PDF全文

下载文献需遵守相关版权规定

论文标题