Aldataset：基于池的主动学习的基准

论文标题

Aldataset：基于池的主动学习的基准

ALdataset: a benchmark for pool-based active learning

论文作者

Zhan, Xueying, Chan, Antoni Bert

论文摘要

主动学习（AL）是机器学习（ML）的子场，其中学习算法可以通过交互式查询用户/Oracle来标记新数据点，从而通过更少的培训样本来实现良好的精度。基于池的AL在许多ML任务中都有良好的动力，在许多ML任务中，未标记的数据很丰富，但很难获得它们的标签。尽管已经开发了许多基于池的AL方法，但缺乏比较基准测试和技术的集成使得很难：1）确定当前的最新技术； 2）评估新方法对数据集各种属性的相对益处； 3）了解哪些具体问题值得更多的关注； 4）测量随着时间的推移，田地的进度。为了在AL方法之间进行更容易的比较评估，我们为基于池的主动学习提供了一项基准任务，该任务包括基准测试数据集和总结整体性能的定量指标。我们介绍了各种主动学习策略的实验结果，包括最近提出的和经典的高度引用方法，并从结果中获取见解。

Active learning (AL) is a subfield of machine learning (ML) in which a learning algorithm could achieve good accuracy with less training samples by interactively querying a user/oracle to label new data points. Pool-based AL is well-motivated in many ML tasks, where unlabeled data is abundant, but their labels are hard to obtain. Although many pool-based AL methods have been developed, the lack of a comparative benchmarking and integration of techniques makes it difficult to: 1) determine the current state-of-the-art technique; 2) evaluate the relative benefit of new methods for various properties of the dataset; 3) understand what specific problems merit greater attention; and 4) measure the progress of the field over time. To conduct easier comparative evaluation among AL methods, we present a benchmark task for pool-based active learning, which consists of benchmarking datasets and quantitative metrics that summarize overall performance. We present experiment results for various active learning strategies, both recently proposed and classic highly-cited methods, and draw insights from the results.

下载PDF全文

下载文献需遵守相关版权规定

论文标题