论文标题
用于审核数据和分类编码的影响的无监督异常检测
Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings
论文作者
论文摘要
在本文中,我们介绍了车辆索赔数据集,其中包括用于汽车维修的欺诈保险索赔。该数据属于审核数据的更广泛类别,其中还包括期刊和网络入侵数据。保险索赔数据与其他审核数据(例如网络入侵数据)的分类属性大不相同。我们解决了缺少基准数据集以进行异常检测的常见问题:数据集大多是机密的,并且公共表格数据集不包含相关和足够的分类属性。因此,为此目的创建了一个大型数据集,并称为车辆索赔(VC)数据集。数据集对浅层和深度学习方法进行评估。由于引入了分类属性,我们遇到了为大数据集编码它们的挑战。当高级基本数据集的一个热编码调用“维度的诅咒”时,我们尝试使用凝胶编码和嵌入层来表示分类属性。我们的工作比较了竞争性学习,重建纠正,密度估计和标签的对比度学习方法,一种热,凝胶编码和嵌入层以处理分类值。
In this paper, we introduce the Vehicle Claims dataset, consisting of fraudulent insurance claims for automotive repairs. The data belongs to the more broad category of Auditing data, which includes also Journals and Network Intrusion data. Insurance claim data are distinctively different from other auditing data (such as network intrusion data) in their high number of categorical attributes. We tackle the common problem of missing benchmark datasets for anomaly detection: datasets are mostly confidential, and the public tabular datasets do not contain relevant and sufficient categorical attributes. Therefore, a large-sized dataset is created for this purpose and referred to as Vehicle Claims (VC) dataset. The dataset is evaluated on shallow and deep learning methods. Due to the introduction of categorical attributes, we encounter the challenge of encoding them for the large dataset. As One Hot encoding of high cardinal dataset invokes the "curse of dimensionality", we experiment with GEL encoding and embedding layer for representing categorical attributes. Our work compares competitive learning, reconstruction-error, density estimation and contrastive learning approaches for Label, One Hot, GEL encoding and embedding layer to handle categorical values.