论文标题
SECODA:基于分割和组合异常的检测
SECODA: Segmentation- and Combination-Based Detection of Anomalies
论文作者
论文摘要
这项研究介绍了SECODA,这是一种新型的通用无监督的非参数异常检测算法,用于包含连续和分类属性的数据集。该方法可以保证识别具有属性值独特或稀疏组合的情况。连续属性反复离散,以正确确定此类值组合的频率。星座的概念,指数增加了权重和离散的切口以及修剪启发式的概念,用于检测具有最佳迭代次数的异常。此外,该算法的内存印记较低,并且其运行时性能随数据集的大小线性缩放。对模拟和现实生活数据集进行的评估表明,该算法能够识别许多不同类型的异常,包括复杂的多维实例。对数据质量用例的评估和真实数据集的评估表明,Secoda可以为现实世界设置带来相关和实践价值。
This study introduces SECODA, a novel general-purpose unsupervised non-parametric anomaly detection algorithm for datasets containing continuous and categorical attributes. The method is guaranteed to identify cases with unique or sparse combinations of attribute values. Continuous attributes are discretized repeatedly in order to correctly determine the frequency of such value combinations. The concept of constellations, exponentially increasing weights and discretization cut points, as well as a pruning heuristic are used to detect anomalies with an optimal number of iterations. Moreover, the algorithm has a low memory imprint and its runtime performance scales linearly with the size of the dataset. An evaluation with simulated and real-life datasets shows that this algorithm is able to identify many different types of anomalies, including complex multidimensional instances. An evaluation in terms of a data quality use case with a real dataset demonstrates that SECODA can bring relevant and practical value to real-world settings.