数据质量对机器学习性能对表格数据的影响

论文标题

数据质量对机器学习性能对表格数据的影响

The Effects of Data Quality on Machine Learning Performance on Tabular Data

论文作者

Mohammed, Sedir, Budach, Lukas, Feuerpfeil, Moritz, Ihde, Nina, Nathansen, Andrea, Noack, Nele, Patzlaff, Hendrik, Naumann, Felix, Harmouch, Hazar

论文摘要

现代人工智能（AI）应用需要大量的培训和测试数据。这种需求不仅会引起有关此类数据的可用性的关键挑战，而且还涉及其质量。例如，不完整，错误或不适当的培训数据可能会导致不可靠的模型，这些模型最终会产生不当的决定。值得信赖的AI应用程序需要沿许多质量维度（例如准确性，完整性和一致性）进行高质量的培训和测试数据。我们从经验上探索六个数据质量维度与涵盖分类，回归和聚类任务的19种流行机器学习算法的性能之间的关系，目的是在数据质量方面解释其性能。我们的实验根据污染数据馈送的AI管道步骤来区分三个方案：受污染的培训数据，测试数据或两者兼而有之。我们以对我们的观察结果进行了广泛的讨论，总结了本文。

Modern artificial intelligence (AI) applications require large quantities of training and test data. This need creates critical challenges not only concerning the availability of such data, but also regarding its quality. For example, incomplete, erroneous, or inappropriate training data can lead to unreliable models that produce ultimately poor decisions. Trustworthy AI applications require high-quality training and test data along many quality dimensions, such as accuracy, completeness, and consistency. We explore empirically the relationship between six data quality dimensions and the performance of 19 popular machine learning algorithms covering the tasks of classification, regression, and clustering, with the goal of explaining their performance in terms of data quality. Our experiments distinguish three scenarios based on the AI pipeline steps that were fed with polluted data: polluted training data, test data, or both. We conclude the paper with an extensive discussion of our observations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题