数据中的表示偏见：关于识别和解决技术的调查

论文标题

数据中的表示偏见：关于识别和解决技术的调查

Representation Bias in Data: A Survey on Identification and Resolution Techniques

论文作者

Shahbazi, Nima, Lin, Yin, Asudeh, Abolfazl, Jagadish, H. V.

论文摘要

数据驱动的算法仅与他们使用的数据一样好，而数据集（尤其是社交数据）通常无法充分代表少数群体。由于各种原因，从历史歧视到选择和制备方法中的选择偏差，因此数据中的表示偏差可能发生。鉴于“偏见，偏见”，人们不能期望基于AI的解决方案在社会应用方面具有公平的结果，而不会解决诸如代表性偏见之类的问题。尽管在机器学习模型中进行了广泛的研究，包括几篇评论论文，但对数据的偏见进行了较少的研究。本文回顾了有关识别和解决表示偏差作为数据集的特征的文献，而与以后消费无关。该调查的范围与结构化（表格）和非结构化（例如图像，文本，图形）数据有限。它提出了分类法，以根据多个设计维度对所研究的技术进行分类，并提供对其属性的并排比较。在数据中充分解决表示偏差问题还有很长的路要走。作者希望这项调查激励研究人员通过观察其各自领域内的现有工作，以应对这些挑战。

Data-driven algorithms are only as good as the data they work with, while data sets, especially social data, often fail to represent minorities adequately. Representation Bias in data can happen due to various reasons ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods. Given that "bias in, bias out", one cannot expect AI-based solutions to have equitable outcomes for societal applications, without addressing issues such as representation bias. While there has been extensive study of fairness in machine learning models, including several review papers, bias in the data has been less studied. This paper reviews the literature on identifying and resolving representation bias as a feature of a data set, independent of how consumed later. The scope of this survey is bounded to structured (tabular) and unstructured (e.g., image, text, graph) data. It presents taxonomies to categorize the studied techniques based on multiple design dimensions and provides a side-by-side comparison of their properties. There is still a long way to fully address representation bias issues in data. The authors hope that this survey motivates researchers to approach these challenges in the future by observing existing work within their respective domains.

下载PDF全文

下载文献需遵守相关版权规定

论文标题