论文标题
Aldataset:基于池的主动学习的基准
ALdataset: a benchmark for pool-based active learning
论文作者
论文摘要
主动学习(AL)是机器学习(ML)的子场,其中学习算法可以通过交互式查询用户/Oracle来标记新数据点,从而通过更少的培训样本来实现良好的精度。基于池的AL在许多ML任务中都有良好的动力,在许多ML任务中,未标记的数据很丰富,但很难获得它们的标签。尽管已经开发了许多基于池的AL方法,但缺乏比较基准测试和技术的集成使得很难:1)确定当前的最新技术; 2)评估新方法对数据集各种属性的相对益处; 3)了解哪些具体问题值得更多的关注; 4)测量随着时间的推移,田地的进度。为了在AL方法之间进行更容易的比较评估,我们为基于池的主动学习提供了一项基准任务,该任务包括基准测试数据集和总结整体性能的定量指标。我们介绍了各种主动学习策略的实验结果,包括最近提出的和经典的高度引用方法,并从结果中获取见解。
Active learning (AL) is a subfield of machine learning (ML) in which a learning algorithm could achieve good accuracy with less training samples by interactively querying a user/oracle to label new data points. Pool-based AL is well-motivated in many ML tasks, where unlabeled data is abundant, but their labels are hard to obtain. Although many pool-based AL methods have been developed, the lack of a comparative benchmarking and integration of techniques makes it difficult to: 1) determine the current state-of-the-art technique; 2) evaluate the relative benefit of new methods for various properties of the dataset; 3) understand what specific problems merit greater attention; and 4) measure the progress of the field over time. To conduct easier comparative evaluation among AL methods, we present a benchmark task for pool-based active learning, which consists of benchmarking datasets and quantitative metrics that summarize overall performance. We present experiment results for various active learning strategies, both recently proposed and classic highly-cited methods, and draw insights from the results.