论文标题
发现Plantvillage数据集中的偏见
Uncovering bias in the PlantVillage dataset
论文作者
论文摘要
我们报告了有关流行的Plantvillage数据集使用用于培训深度学习基于植物性疾病检测模型的调查。我们仅使用来自PlantVillage图像背景的8个像素培训了机器学习模型。该模型在持有的测试集上达到了49.0%的精度,远高于2.6%的随机猜测精度。该结果表明,PlantVillage数据集包含与标签相关的噪声,深度学习模型可以轻松利用这种偏见来做出预测。讨论了缓解此问题的可能方法。
We report our investigation on the use of the popular PlantVillage dataset for training deep learning based plant disease detection models. We trained a machine learning model using only 8 pixels from the PlantVillage image backgrounds. The model achieved 49.0% accuracy on the held-out test set, well above the random guessing accuracy of 2.6%. This result indicates that the PlantVillage dataset contains noise correlated with the labels and deep learning models can easily exploit this bias to make predictions. Possible approaches to alleviate this problem are discussed.