论文标题
研究语义量表不平衡
Delving into Semantic Scale Imbalance
论文作者
论文摘要
已广泛研究了由长尾数据触发的模型偏差。但是,基于样品数量的测量不能同时阐明三个现象:(1)给定数据,分类性能增益与其他样本有关。 (2)当数据不足时,随着训练样本的数量减少,分类性能会逐渐衰落。 (3)在样本平衡数据集上训练的模型仍然对不同类别具有不同的偏差。在这项工作中,我们定义和量化了类的语义规模,该类别用于衡量类的特征多样性。令人兴奋的是,在实验上发现语义尺度的边际效应,它完美地描述了前两个现象。此外,提出了语义尺度不平衡的定量测量,即使在样本均衡数据上,它也可以准确地反映多个数据集的模型偏差,从而揭示了对类别不平衡研究的新观点。由于语义尺度不平衡的流行,我们提出了语义尺度平衡的学习,包括一般的损失改进方案和动态重新加权训练框架,这克服了在迭代过程中实时计算语义量表的挑战。综合实验表明,动态语义平衡的学习始终使该模型能够在大规模长尾和非长尾的天然和医疗数据集上表现出色,这是减轻普遍但未被发现的模型偏置的好起点。
Model bias triggered by long-tailed data has been widely studied. However, measure based on the number of samples cannot explicate three phenomena simultaneously: (1) Given enough data, the classification performance gain is marginal with additional samples. (2) Classification performance decays precipitously as the number of training samples decreases when there is insufficient data. (3) Model trained on sample-balanced datasets still has different biases for different classes. In this work, we define and quantify the semantic scale of classes, which is used to measure the feature diversity of classes. It is exciting to find experimentally that there is a marginal effect of semantic scale, which perfectly describes the first two phenomena. Further, the quantitative measurement of semantic scale imbalance is proposed, which can accurately reflect model bias on multiple datasets, even on sample-balanced data, revealing a novel perspective for the study of class imbalance. Due to the prevalence of semantic scale imbalance, we propose semantic-scale-balanced learning, including a general loss improvement scheme and a dynamic re-weighting training framework that overcomes the challenge of calculating semantic scales in real-time during iterations. Comprehensive experiments show that dynamic semantic-scale-balanced learning consistently enables the model to perform superiorly on large-scale long-tailed and non-long-tailed natural and medical datasets, which is a good starting point for mitigating the prevalent but unnoticed model bias.