论文标题
混合数据类型(SGCRM)的半参数高斯副群回归建模
Semiparametric Gaussian Copula Regression modeling for Mixed Data Types (SGCRM)
论文作者
论文摘要
许多临床和流行病学研究通过连续,截断,序数和二进制变量的集合编码收集的参与者级信息。为了获得理解收集变量之间复杂相互作用的新见解,对于混合数据类型变量的联合建模的柔性框架的开发至关重要。我们提出了半参数高斯辅助回归模型(SGCRM),该模型允许在观察到的连续,截短,序列和二进制变量之间对关节依赖性结构进行建模,并以这四种数据类型为有条件的模型,并构建有条件的模型,并具有衍生有条件模型的保证。半参数高斯副管(SGC)机制假设观察到的SGC变量是由-i)单调转换潜在多变量正常随机变量的边际和ii)二分化/截断这些转变的边际的。 SGCRM通过肯德尔(Kendall)的tau秩相关性之间的“桥梁”反转观察到的混合数据类型变量和潜在的高斯相关性之间的“桥梁”估算了潜在正常变量的相关矩阵。我们得出一个新的桥接结果来处理一般的序数变量。除了先前确定的渐近一致性外,我们还建立了潜在相关估计量的渐近正态性。我们还建立了SGCRM回归估计量的渐近正态性,并提供了计算有效的计算渐近协方差的方法。我们建议使用计算有效的方法来预测SGC潜在变量并进行丢失的数据插补。使用国家健康和营养检查调查(NHANES),我们说明了SGCRM并将其与传统的条件回归模型进行比较,包括截短的高斯回归,序数概率和概率模型。
Many clinical and epidemiological studies encode collected participant-level information via a collection of continuous, truncated, ordinal, and binary variables. To gain novel insights in understanding complex interactions between collected variables, there is a critical need for the development of flexible frameworks for joint modeling of mixed data types variables. We propose Semiparametric Gaussian Copula Regression modeling (SGCRM) that allows to model a joint dependence structure between observed continuous, truncated, ordinal, and binary variables and to construct conditional models with these four data types as outcomes with a guarantee that derived conditional models are mutually consistent. Semiparametric Gaussian Copula (SGC) mechanism assumes that observed SGC variables are generated by - i) monotonically transforming marginals of latent multivariate normal random variable and ii) dichotimizing/truncating these transformed marginals. SGCRM estimates the correlation matrix of the latent normal variables through an inversion of "bridges" between Kendall's Tau rank correlations of observed mixed data type variables and latent Gaussian correlations. We derive a novel bridging result to deal with a general ordinal variable. In addition to the previously established asymptotic consistency, we establish asymptotic normality of the latent correlation estimators. We also establish the asymptotic normality of SGCRM regression estimators and provide a computationally efficient way to calculate asymptotic covariances. We propose computationally efficient methods to predict SGC latent variables and to do imputation of missing data. Using National Health and Nutrition Examination Survey (NHANES), we illustrate SGCRM and compare it with the traditional conditional regression models including truncated Gaussian regression, ordinal probit, and probit models.