论文标题
评估和制作数据集有效地使用数据图进行深度学习
Evaluating and Crafting Datasets Effective for Deep Learning With Data Maps
论文作者
论文摘要
深度学习模型的快速发展促使人们对适当的培训数据的需求增加。大型数据集的普及(有时称为“大数据”)从评估它们的质量中转移了关注。在大型数据集上进行培训通常需要过度的系统资源和不可行的时间。此外,监督的机器学习过程尚未完全自动化:对于监督学习,大型数据集需要更多时间来手动标记样本。我们提出了一种在初始培训会话后使用可比分布模型准确性策划较小的数据集的方法,该方法使用了适当的样本分布,该样本分类得出,该样品对模型很难从模型中学习。
Rapid development in deep learning model construction has prompted an increased need for appropriate training data. The popularity of large datasets - sometimes known as "big data" - has diverted attention from assessing their quality. Training on large datasets often requires excessive system resources and an infeasible amount of time. Furthermore, the supervised machine learning process has yet to be fully automated: for supervised learning, large datasets require more time for manually labeling samples. We propose a method of curating smaller datasets with comparable out-of-distribution model accuracy after an initial training session using an appropriate distribution of samples classified by how difficult it is for a model to learn from them.