论文标题

缩放技术的选择对于分类性能很重要

The choice of scaling technique matters for classification performance

论文作者

de Amorim, Lucas B. V., Cavalcanti, George D. C., Cruz, Rafael M. O.

论文摘要

数据集缩放(也称为归一化)是机器学习管道中必不可少的预处理步骤。它的目的是以在同一范围内变化的方式调整属性量表。已知这种转换可以改善分类模型的性能,但是有几种缩放技术可供选择,而且通常不仔细地完成此选择。在本文中,我们执行了一个广泛的实验,比较了5个缩放技术对整体和集合模型中20种分类算法的性能的影响,并将其应用于82个具有不同不平衡比率的公开数据集。结果表明,缩放技术的选择对于分类性能很重要,在大多数情况下,最佳和最差的缩放技术之间的性能差异是相关且具有统计学意义的。他们还表明,选择不充分的技术可能对分类性能更有害,而不是完全不扩展数据。我们还展示了集合模型的性能变化,即考虑不同的缩放技术,往往由其基本模型的表现。最后,我们讨论了模型对缩放技术选择及其性能的敏感性之间的关系,并提供了对其在不同模型部署方案的适用性的见解。本文中的实验的完整结果和源代码可在GitHub存储库中获得。

Dataset scaling, also known as normalization, is an essential preprocessing step in a machine learning pipeline. It is aimed at adjusting attributes scales in a way that they all vary within the same range. This transformation is known to improve the performance of classification models, but there are several scaling techniques to choose from, and this choice is not generally done carefully. In this paper, we execute a broad experiment comparing the impact of 5 scaling techniques on the performances of 20 classification algorithms among monolithic and ensemble models, applying them to 82 publicly available datasets with varying imbalance ratios. Results show that the choice of scaling technique matters for classification performance, and the performance difference between the best and the worst scaling technique is relevant and statistically significant in most cases. They also indicate that choosing an inadequate technique can be more detrimental to classification performance than not scaling the data at all. We also show how the performance variation of an ensemble model, considering different scaling techniques, tends to be dictated by that of its base model. Finally, we discuss the relationship between a model's sensitivity to the choice of scaling technique and its performance and provide insights into its applicability on different model deployment scenarios. Full results and source code for the experiments in this paper are available in a GitHub repository.\footnote{https://github.com/amorimlb/scaling\_matters}

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源