机器学习模型的概括性：三个方法论陷阱的定量评估

论文标题

机器学习模型的概括性：三个方法论陷阱的定量评估

Generalizability of Machine Learning Models: Quantitative Evaluation of Three Methodological Pitfalls

论文作者

Maleki, Farhad, Ovens, Katie, Gupta, Rajiv, Reinhold, Caroline, Spatz, Alan, Forghani, Reza

论文摘要

目的：尽管机器学习模型有潜力，但缺乏普遍性阻碍了他们在临床实践中的广泛采用。我们研究了三个方法论陷阱：（1）违反独立性假设，（2）模型评估，具有不适当的性能指标或基线以进行比较，以及（3）批处理效应。材料和方法：使用几个回顾性数据集，我们在有或没有陷阱的情况下实现了机器学习模型，以定量说明这些陷阱对模型通用性的影响。结果：更具体地说，违反独立性假设，在将数据分别分为火车，验证和测试集之前应用过度采样，功能选择和数据增强，从而在预测局部重复的F1得分中误导了71.2％的误导性和表面上的提高，预测局部和颈部癌症的3年整体生存范围和46.66的整体生存范围及46.66的整体生存。此外，在培训，验证和测试集中为受试者分配数据点，导致F1分数的表面增长21.8％。此外，我们展示了绩效指标选择和基线的重要性。在存在批处理效应的情况下，为肺炎检测而建立的模型导致F1得分为98.7％。但是，当将同一模型应用于正常患者的新数据集时，仅正确地将3.86％的样本分类。结论：这些方法论上的陷阱无法使用内部模型评估来捕获，并且这种模型的不准确预测可能会导致错误的结论和解释。因此，对于开发可推广的模型是必要的，理解和避免这些陷阱是必要的。

Purpose: Despite the potential of machine learning models, the lack of generalizability has hindered their widespread adoption in clinical practice. We investigate three methodological pitfalls: (1) violation of independence assumption, (2) model evaluation with an inappropriate performance indicator or baseline for comparison, and (3) batch effect. Materials and Methods: Using several retrospective datasets, we implement machine learning models with and without the pitfalls to quantitatively illustrate these pitfalls' effect on model generalizability. Results: Violation of independence assumption, more specifically, applying oversampling, feature selection, and data augmentation before splitting data into train, validation, and test sets, respectively, led to misleading and superficial gains in F1 scores of 71.2% in predicting local recurrence and 5.0% in predicting 3-year overall survival in head and neck cancer as well as 46.0% in distinguishing histopathological patterns in lung cancer. Further, randomly distributing data points for a subject across training, validation, and test sets led to a 21.8% superficial increase in F1 score. Also, we showed the importance of the choice of performance measures and baseline for comparison. In the presence of batch effect, a model built for pneumonia detection led to F1 score of 98.7%. However, when the same model was applied to a new dataset of normal patients, it only correctly classified 3.86% of the samples. Conclusions: These methodological pitfalls cannot be captured using internal model evaluation, and the inaccurate predictions made by such models may lead to wrong conclusions and interpretations. Therefore, understanding and avoiding these pitfalls is necessary for developing generalizable models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题