论文标题
在表格数据中缺少价值插补的扩散模型
Diffusion models for missing value imputation in tabular data
论文作者
论文摘要
机器学习中缺少价值归因是使用可用信息准确估算数据集中缺失值的任务。在此任务中,已经提出了几种深层生成建模方法,并证明了它们的实用性,例如生成对抗性插补网络。最近,扩散模型因其在图像,文本,音频等中生成建模任务中的有效性而获得了知名度。据我们所知,对对表格数据中缺少价值插入的扩散模型的有效性的调查减少了关注。基于时间序列数据插补的扩散模型的最新开发,我们提出了一种扩散模型方法,称为“表格数据的基于条件得分的扩散模型”(TABCSDI)。为了有效地处理分类变量和数值变量,我们研究了三种技术:一列编码,模拟位编码和特征令牌化。与众所周知的现有方法相比,基准数据集的实验结果证明了TABCSDI的有效性,并且还强调了分类嵌入技术的重要性。
Missing value imputation in machine learning is the task of estimating the missing values in the dataset accurately using available information. In this task, several deep generative modeling methods have been proposed and demonstrated their usefulness, e.g., generative adversarial imputation networks. Recently, diffusion models have gained popularity because of their effectiveness in the generative modeling task in images, texts, audio, etc. To our knowledge, less attention has been paid to the investigation of the effectiveness of diffusion models for missing value imputation in tabular data. Based on recent development of diffusion models for time-series data imputation, we propose a diffusion model approach called "Conditional Score-based Diffusion Models for Tabular data" (TabCSDI). To effectively handle categorical variables and numerical variables simultaneously, we investigate three techniques: one-hot encoding, analog bits encoding, and feature tokenization. Experimental results on benchmark datasets demonstrated the effectiveness of TabCSDI compared with well-known existing methods, and also emphasized the importance of the categorical embedding techniques.