论文标题

基于形态的重叠和失衡的形态重新访问数据复杂性:快照,新的重叠球数指标和奇异问题的前景

Revisiting Data Complexity Metrics Based on Morphology for Overlap and Imbalance: Snapshot, New Overlap Number of Balls Metrics and Singular Problems Prospect

论文作者

Pascual-Triana, José Daniel, Charte, David, Arroyo, Marta Andrés, Fernández, Alberto, Herrera, Francisco

论文摘要

数据科学和机器学习已成为公司和研究机构的基本资产。作为其领域之一,监督分类允许对新样本进行类预测,并从给定的培训数据中学习。但是,某些属性可能导致数据集分类为问题。 为了评估数据集先验,数据复杂性指标已被广泛使用。他们提供有关数据不同内在特征的信息,这些信息可评估分类器兼容性和改善性能的行动方案。但是,大多数复杂性指标仅关注数据的一个特征,这些特征不足以正确评估数据集针对分类器的性能。实际上,班级重叠是分类过程的非常有害的特征(尤其是在类标签之间存在不平衡时)很难评估。 这项研究工作着重于基于数据形态的复杂度指标。根据他们的性质,前提是他们为班级重叠提供了良好的估计,又提供了与分类绩效的巨大相关性。为此,已经开发了一个新颖的指标系列。它们基于上课的球覆盖范围,以重叠数的球命名。最后,讨论了以前的指标家族对单数(更复杂)问题的适应的一些前景。

Data Science and Machine Learning have become fundamental assets for companies and research institutions alike. As one of its fields, supervised classification allows for class prediction of new samples, learning from given training data. However, some properties can cause datasets to be problematic to classify. In order to evaluate a dataset a priori, data complexity metrics have been used extensively. They provide information regarding different intrinsic characteristics of the data, which serve to evaluate classifier compatibility and a course of action that improves performance. However, most complexity metrics focus on just one characteristic of the data, which can be insufficient to properly evaluate the dataset towards the classifiers' performance. In fact, class overlap, a very detrimental feature for the classification process (especially when imbalance among class labels is also present) is hard to assess. This research work focuses on revisiting complexity metrics based on data morphology. In accordance to their nature, the premise is that they provide both good estimates for class overlap, and great correlations with the classification performance. For that purpose, a novel family of metrics have been developed. Being based on ball coverage by classes, they are named after Overlap Number of Balls. Finally, some prospects for the adaptation of the former family of metrics to singular (more complex) problems are discussed.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源