论文标题
在仇恨言论检测中降级种族偏见
Demoting Racial Bias in Hate Speech Detection
论文作者
论文摘要
在当前的仇恨言论数据集中,注释者对毒性的看法与非裔美国人英语(AAE)的信号之间存在很高的相关性。带注释的培训数据和机器学习模型放大其倾向的偏见,导致AAE文本通常被错误地标记为当前的仇恨言语分类器,并以高误报率将其标记为滥用/进攻/仇恨言论。在本文中,我们使用对抗性培训来减轻这种偏见,引入了一个仇恨言语分类器,该分类器学会了检测有毒句子的同时降级与AAE文本相对应的混杂。仇恨言语数据集和AAE数据集的实验结果表明,我们的方法能够大大降低AAE文本的假阳性率,而仅对仇恨言语分类的表现极少地影响。
In current hate speech datasets, there exists a high correlation between annotators' perceptions of toxicity and signals of African American English (AAE). This bias in annotated training data and the tendency of machine learning models to amplify it cause AAE text to often be mislabeled as abusive/offensive/hate speech with a high false positive rate by current hate speech classifiers. In this paper, we use adversarial training to mitigate this bias, introducing a hate speech classifier that learns to detect toxic sentences while demoting confounds corresponding to AAE texts. Experimental results on a hate speech dataset and an AAE dataset suggest that our method is able to substantially reduce the false positive rate for AAE text while only minimally affecting the performance of hate speech classification.