论文标题
SUOD:朝着无监督的离群值检测迈进
SUOD: Toward Scalable Unsupervised Outlier Detection
论文作者
论文摘要
离群值检测是用于识别异常数据对象的机器学习的关键领域。由于获得地面真理的高昂费用,经常在实践中选择无监督的模型。为了弥补无监督算法的不稳定性质,来自金融,健康和安全等高风险领域的从业者更喜欢建立大量模型来进行进一步的组合和分析。但是,这在高维大数据集中构成了可伸缩性挑战。在这项研究中,我们提出了一个称为SUOD的三模块加速框架,以加快训练和预测,并通过大量无监督的检测模型进行预测。 Suod的随机投影模块可以在保留其距离关系的同时为高维数据集生成较低的子空间。平衡并行调度模块可以预测具有高信心的模型的培训和预测成本 - 因此,任务调度程序可以分配工人之间几乎相等的任务负载,以进行有效的并行化。 Suod还带有一个伪监督的近似模块,该模块可以通过较低的时间复杂性监督回归器近似拟合的无监督模型,以快速预测看不见的数据。它可以被视为无监督的模型知识蒸馏过程。值得注意的是,这三个模块都是独立的,具有“混合和匹配”的灵活性。可以根据用例选择模块的组合。在30多个基准数据集上进行了广泛的实验表明了Suod的功效,并提出了全面的未来发展计划。
Outlier detection is a key field of machine learning for identifying abnormal data objects. Due to the high expense of acquiring ground truth, unsupervised models are often chosen in practice. To compensate for the unstable nature of unsupervised algorithms, practitioners from high-stakes fields like finance, health, and security, prefer to build a large number of models for further combination and analysis. However, this poses scalability challenges in high-dimensional large datasets. In this study, we propose a three-module acceleration framework called SUOD to expedite the training and prediction with a large number of unsupervised detection models. SUOD's Random Projection module can generate lower subspaces for high-dimensional datasets while reserving their distance relationship. Balanced Parallel Scheduling module can forecast the training and prediction cost of models with high confidence---so the task scheduler could assign nearly equal amount of taskload among workers for efficient parallelization. SUOD also comes with a Pseudo-supervised Approximation module, which can approximate fitted unsupervised models by lower time complexity supervised regressors for fast prediction on unseen data. It may be considered as an unsupervised model knowledge distillation process. Notably, all three modules are independent with great flexibility to "mix and match"; a combination of modules can be chosen based on use cases. Extensive experiments on more than 30 benchmark datasets have shown the efficacy of SUOD, and a comprehensive future development plan is also presented.