论文标题
公共数据集中的数据气味
Data Smells in Public Datasets
论文作者
论文摘要
在医疗保健,野生动植物保护,自主驾驶和刑事司法系统等高风险领域中采用人工智能(AI)要求采用以数据为中心的AI方法。数据科学家将大部分时间都花在研究和争论数据上,但是缺乏帮助他们进行数据分析的工具。这项研究确定了公共数据集中的复发性数据质量问题。类似于代码气味,我们引入了一种新颖的数据气味目录,可用于指示机器学习系统中问题或技术债务的早期迹象。为了了解数据集中数据质量问题的普遍性,我们分析了25个公共数据集并识别14种数据气味。
The adoption of Artificial Intelligence (AI) in high-stakes domains such as healthcare, wildlife preservation, autonomous driving and criminal justice system calls for a data-centric approach to AI. Data scientists spend the majority of their time studying and wrangling the data, yet tools to aid them with data analysis are lacking. This study identifies the recurrent data quality issues in public datasets. Analogous to code smells, we introduce a novel catalogue of data smells that can be used to indicate early signs of problems or technical debt in machine learning systems. To understand the prevalence of data quality issues in datasets, we analyse 25 public datasets and identify 14 data smells.