生成树：对抗和模仿

论文标题

生成树：对抗和模仿

Generative Trees: Adversarial and Copycat

论文作者

Nock, Richard, Guillame-Bert, Mathieu

论文摘要

尽管生成的对抗网络（GAN）在图像之类的非结构化数据上取得了壮观的结果，但表格数据仍然存在差距，而在其中，最先进的学习状态仍然在很大程度上支持基于决策树（DT）的模型。本文提出了一条新的路径，以生成表格数据，从而利用了数十年来对监督任务的最佳DT诱导组件的了解，从损失（属性），模型（基于树）到算法（增强）。 The \textit{properness} condition on the supervised loss -- which postulates the optimality of Bayes rule -- leads us to a variational GAN-style loss formulation which is \textit{tight} when discriminators meet a calibration property trivially satisfied by DTs, and, under common assumptions about the supervised loss, yields "one loss to train against them all" for the generator: the $χ^2$.然后，我们介绍了基于树的生成模型\ textIt {Generative Tree}（GTS），该模型旨在在生成方面镜像DTS的良好属性，用于对表格数据进行分类，并具有符合促进的\ textit {vextIt {vextIt {verversial}训练GTS的训练算法。我们还介绍了\ textit {模仿培训}，其中生成器在运行时复制了鉴别器DT的基础树（图），并完成了最困难的歧视任务，并通过提高兼容的融合。我们在包括假/真实区别，伪造数据培训和缺少数据插补的任务上测试算法。这些任务中的每一个都显示，GTS可以提供相对简单（可解释的）竞争者，以实现数据生成的先进状态（使用神经网络模型）或缺失的数据插补（依靠带有基于复杂树建模的链式插定）。

While Generative Adversarial Networks (GANs) achieve spectacular results on unstructured data like images, there is still a gap on tabular data, data for which state of the art supervised learning still favours to a large extent decision tree (DT)-based models. This paper proposes a new path forward for the generation of tabular data, exploiting decades-old understanding of the supervised task's best components for DT induction, from losses (properness), models (tree-based) to algorithms (boosting). The \textit{properness} condition on the supervised loss -- which postulates the optimality of Bayes rule -- leads us to a variational GAN-style loss formulation which is \textit{tight} when discriminators meet a calibration property trivially satisfied by DTs, and, under common assumptions about the supervised loss, yields "one loss to train against them all" for the generator: the $χ^2$. We then introduce tree-based generative models, \textit{generative trees} (GTs), meant to mirror on the generative side the good properties of DTs for classifying tabular data, with a boosting-compliant \textit{adversarial} training algorithm for GTs. We also introduce \textit{copycat training}, in which the generator copies at run time the underlying tree (graph) of the discriminator DT and completes it for the hardest discriminative task, with boosting compliant convergence. We test our algorithms on tasks including fake/real distinction, training from fake data and missing data imputation. Each one of these tasks displays that GTs can provide comparatively simple -- and interpretable -- contenders to sophisticated state of the art methods for data generation (using neural network models) or missing data imputation (relying on multiple imputation by chained equations with complex tree-based modeling).

下载PDF全文

下载文献需遵守相关版权规定

论文标题