论文标题
在单细胞RNA测序数据的潜在变量估计之后的推断
Inference after latent variable estimation for single-cell RNA sequencing data
论文作者
论文摘要
在对单细胞RNA测序数据的分析中,研究人员经常通过估计潜在变量(例如细胞类型或假频率)来表征细胞之间的变化,这代表了单个细胞状态的某些方面。然后,他们测试每个基因与估计的潜在变量的关联。如果这两个步骤都使用相同的数据,则在第二步中计算p值和置信区间的标准方法将无法实现统计保证,例如类型1误差控制。此外,在这种情况下不适用于在其他设置中解决类似问题之类的样品分割等方法。在本文中,我们引入了计数分裂,这是一个灵活的框架,使我们能够在此环境中进行有效的推断,几乎在泊松假设下,几乎任何潜在的可变估计技术和推理方法。我们在模拟研究中演示了1型误差控制和计数分裂的能力,并将计数分裂应用于与心肌细胞区别的多能干细胞数据集。
In the analysis of single-cell RNA sequencing data, researchers often characterize the variation between cells by estimating a latent variable, such as cell type or pseudotime, representing some aspect of the individual cell's state. They then test each gene for association with the estimated latent variable. If the same data are used for both of these steps, then standard methods for computing p-values and confidence intervals in the second step will fail to achieve statistical guarantees such as Type 1 error control. Furthermore, approaches such as sample splitting that can be applied to solve similar problems in other settings are not applicable in this context. In this paper, we introduce count splitting, a flexible framework that allows us to carry out valid inference in this setting, for virtually any latent variable estimation technique and inference approach, under a Poisson assumption. We demonstrate the Type 1 error control and power of count splitting in a simulation study, and apply count splitting to a dataset of pluripotent stem cells differentiating to cardiomyocytes.