使用生成对抗网络综合属性和伤亡比例的数据集

论文标题

使用生成对抗网络综合属性和伤亡比例的数据集

Synthesizing Property & Casualty Ratemaking Datasets using Generative Adversarial Networks

论文作者

Cote, Marie-Pier, Hartman, Brian, Mercier, Olivier, Meyers, Joshua, Cummings, Jared, Harmon, Elijah

论文摘要

由于机密性问题，对于精算科学的方法论开发或个人数据很重要的其他领域，可能很难访问或共享有趣的数据集。我们展示了如何设计三种不同类型的生成对抗网络（GAN），它们可以从机密的原始数据集中构建合成保险数据集。目标是获得不再包含敏感信息的合成数据，但仍具有与原始数据集相同的结构并保留多元关系。为了充分对保险数据的特定特征进行建模，我们使用适用于多类别数据的GAN体系结构：具有梯度惩罚的Wassertein Gan（MC-WGAN-GP）（MC-WGAN-GP），有条件的表格gan（CTGAN）和混合的数值和绝对的私人私人GAN（MNCDP-GAN）。对于透明度，使用公共数据集（法国电动机第三方责任数据）说明了这些方法。我们比较了各个方面的三种不同gan：能够重现原始数据结构和预测模型，隐私和易用性的能力。我们发现MC-WGAN-GP合成了最佳数据，CTGAN最容易使用，而MNCDP-GAN保证了差异隐私。

Due to confidentiality issues, it can be difficult to access or share interesting datasets for methodological development in actuarial science, or other fields where personal data are important. We show how to design three different types of generative adversarial networks (GANs) that can build a synthetic insurance dataset from a confidential original dataset. The goal is to obtain synthetic data that no longer contains sensitive information but still has the same structure as the original dataset and retains the multivariate relationships. In order to adequately model the specific characteristics of insurance data, we use GAN architectures adapted for multi-categorical data: a Wassertein GAN with gradient penalty (MC-WGAN-GP), a conditional tabular GAN (CTGAN) and a Mixed Numerical and Categorical Differentially Private GAN (MNCDP-GAN). For transparency, the approaches are illustrated using a public dataset, the French motor third party liability data. We compare the three different GANs on various aspects: ability to reproduce the original data structure and predictive models, privacy, and ease of use. We find that the MC-WGAN-GP synthesizes the best data, the CTGAN is the easiest to use, and the MNCDP-GAN guarantees differential privacy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题