使用光学图像的机器学习中数据集漂移控制的数据模型

论文标题

使用光学图像的机器学习中数据集漂移控制的数据模型

Data Models for Dataset Drift Controls in Machine Learning With Optical Images

论文作者

Oala, Luis, Aversa, Marco, Nobis, Gabriel, Willis, Kurt, Neuenschwander, Yoan, Buck, Michèle, Matek, Christian, Extermann, Jerome, Pomarico, Enrico, Samek, Wojciech, Murray-Smith, Roderick, Clausen, Christoph, Sanguinetti, Bruno

论文摘要

相机图像在机器学习研究中无处不在。它们在提供跨越医学和环境测量的重要服务方面也起着核心作用。但是，由于鲁棒性问题，机器学习模型在这些领域的应用受到限制。主要故障模式是由于培训数据和部署数据之间的差异，性能下降。尽管有一些方法可以预期验证机器学习模型对此类数据集漂移的鲁棒性，但现有方法并未考虑关注的主要对象的明确模型：数据。这限制了我们以身体准确的方式研究和理解数据生成与下游机器学习模型性能之间关系的能力。在这项研究中，我们演示了如何通过将传统的机器学习与物理光学配对以获得明确和可区分的数据模型来克服这一限制。我们演示了如何为图像数据构建此类数据模型，并用于控制与数据集漂移相关的下游机器学习模型性能。这些发现被蒸馏成三个应用。首先，漂移合成使受控的物理忠实漂移测试案例的受控生成能够选择模型选择和有针对性的概括。其次，机器学习任务模型和数据模型之间的梯度连接允许对任务模型对数据生成变化的敏感性进行高级，精确的公差。这些漂移取证可用于精确指定可以运行任务模型的可接受数据环境。第三，漂移优化开辟了创建漂移的可能性，可以帮助任务模型更快地学习，从而有效地优化数据生成过程本身。访问打开代码和数据集的指南可从https://github.com/aiaudit-org/raw2logit获得。

Camera images are ubiquitous in machine learning research. They also play a central role in the delivery of important services spanning medicine and environmental surveying. However, the application of machine learning models in these domains has been limited because of robustness concerns. A primary failure mode are performance drops due to differences between the training and deployment data. While there are methods to prospectively validate the robustness of machine learning models to such dataset drifts, existing approaches do not account for explicit models of the primary object of interest: the data. This limits our ability to study and understand the relationship between data generation and downstream machine learning model performance in a physically accurate manner. In this study, we demonstrate how to overcome this limitation by pairing traditional machine learning with physical optics to obtain explicit and differentiable data models. We demonstrate how such data models can be constructed for image data and used to control downstream machine learning model performance related to dataset drift. The findings are distilled into three applications. First, drift synthesis enables the controlled generation of physically faithful drift test cases to power model selection and targeted generalization. Second, the gradient connection between machine learning task model and data model allows advanced, precise tolerancing of task model sensitivity to changes in the data generation. These drift forensics can be used to precisely specify the acceptable data environments in which a task model may be run. Third, drift optimization opens up the possibility to create drifts that can help the task model learn better faster, effectively optimizing the data generating process itself. A guide to access the open code and datasets is available at https://github.com/aiaudit-org/raw2logit.

下载PDF全文

下载文献需遵守相关版权规定

论文标题