季节性平均单依赖性估计器：一种用于解决高维流分类中季节概念漂移的新型算法

论文标题

季节性平均单依赖性估计器：一种用于解决高维流分类中季节概念漂移的新型算法

Seasonal Averaged One-Dependence Estimators: A Novel Algorithm to Address Seasonal Concept Drift in High-Dimensional Stream Classification

论文作者

Godahewa, Rakshitha, Yann, Trevor, Bergmeir, Christoph, Petitjean, Francois

论文摘要

当新标记的样本到达时，流分类方法将连续的数据流分类。他们通常还必须处理概念漂移。本文着重于流分类中的季节性漂移，可以在许多现实世界应用数据源中找到。传统的溪流分类方法考虑通过包括季节性虚拟/指示变量或为每个季节建立单独的型号来考虑季节性漂移。但是，这些方法在高维分类问题或复杂的季节性模式中具有强大的局限性。本文探讨了如何在新闻文章分类（或分类/标记）的特定背景下最好地处理季节性漂移，其中季节性漂移绝大多数是数据中存在的主要漂移类型，并且数据是高度二维的。我们介绍了一个名为“季节性平均一依赖性估计器”（SAODE）的新颖分类器，该分类器将AODE分类器扩展到处理季节性漂移，包括作为超级父母的时间。我们使用两个大型现实世界文本挖掘的数据集评估我们的SAODE模型，该数据集约有大约一百万个记录，与9个最先进的流和概念漂移分类模型，具有和没有季节性指标，并为每个季节构建单独的模型。在五种不同的评估技术中，我们表明我们的模型始终在结果上具有统计学意义的大幅度优于其他方法。

Stream classification methods classify a continuous stream of data as new labelled samples arrive. They often also have to deal with concept drift. This paper focuses on seasonal drift in stream classification, which can be found in many real-world application data sources. Traditional approaches of stream classification consider seasonal drift by including seasonal dummy/indicator variables or building separate models for each season. But these approaches have strong limitations in high-dimensional classification problems, or with complex seasonal patterns. This paper explores how to best handle seasonal drift in the specific context of news article categorization (or classification/tagging), where seasonal drift is overwhelmingly the main type of drift present in the data, and for which the data are high-dimensional. We introduce a novel classifier named Seasonal Averaged One-Dependence Estimators (SAODE), which extends the AODE classifier to handle seasonal drift by including time as a super parent. We assess our SAODE model using two large real-world text mining related datasets each comprising approximately a million records, against nine state-of-the-art stream and concept drift classification models, with and without seasonal indicators and with separate models built for each season. Across five different evaluation techniques, we show that our model consistently outperforms other methods by a large margin where the results are statistically significant.

下载PDF全文

下载文献需遵守相关版权规定

论文标题