论文标题

数据流中有效的子空间搜索

Efficient Subspace Search in Data Streams

论文作者

Fouché, Edouard, Kalinke, Florian, Böhm, Klemens

论文摘要

在现实世界中,数据流无处不在 - 考虑网络流量或传感器数据。采矿模式,例如离群值或集群必须从这些数据中实时进行。这是具有挑战性的,因为(1)流通常具有较高的维度,并且(2)数据特性可能会随着时间而变化。现有的方法倾向于仅关注一个方面,即高维度或流设置的细节。对于静态数据,一种处理高维度的常见方法(称为子空间搜索)提取了低维,“有趣”的预测(子空间),其中模式更易于找到。在本文中,我们通过将子空间搜索推广到数据流来解决挑战(1)和(2)。我们的方法,流式贪婪的最大随机偏差(SGMRD),在高维数据流中监视有趣的子空间。它利用基于匪徒理论的新型多元依赖性估计量和监测技术。我们表明SGMRD的好处是双重的:(i)有效监视子空间,(ii)这改善了下游数据挖掘任务的结果,例如离群检测。我们对合成和现实世界数据进行的实验表明,SGMRD优于其竞争对手的差距很大。

In the real world, data streams are ubiquitous -- think of network traffic or sensor data. Mining patterns, e.g., outliers or clusters, from such data must take place in real time. This is challenging because (1) streams often have high dimensionality, and (2) the data characteristics may change over time. Existing approaches tend to focus on only one aspect, either high dimensionality or the specifics of the streaming setting. For static data, a common approach to deal with high dimensionality -- known as subspace search -- extracts low-dimensional, `interesting' projections (subspaces), in which patterns are easier to find. In this paper, we address both Challenge (1) and (2) by generalising subspace search to data streams. Our approach, Streaming Greedy Maximum Random Deviation (SGMRD), monitors interesting subspaces in high-dimensional data streams. It leverages novel multivariate dependency estimators and monitoring techniques based on bandit theory. We show that the benefits of SGMRD are twofold: (i) It monitors subspaces efficiently, and (ii) this improves the results of downstream data mining tasks, such as outlier detection. Our experiments, performed against synthetic and real-world data, demonstrate that SGMRD outperforms its competitors by a large margin.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源