论文标题
疾病变异预测的无监督语言模型
Unsupervised language models for disease variant prediction
论文作者
论文摘要
预测人基因蛋白质变异的致病性具有很大的兴趣。由于高质量标签的稀疏性,最近的方法将使用多个序列比对(MSA)来训练每个基因内自然序列变化的生成模型。然后,这些生成模型将变异可能性预测为进化适应性的代理。在这项工作中,我们相反,将该进化原理与验证的蛋白质语言模型(LMS)结合在一起,这些原理已经显示出有希望的结果,可以预测蛋白质的结构和功能。我们发现,在广泛序列数据集上训练的单个蛋白LM可以为任何基因变体零拍摄,而无需MSAS或FINETUNTIN,而不是训练单独的模型。我们称这种无监督的方法\ textbf {velm}(通过语言模型的变体效应),并证明它在对临床标记的疾病相关基因的变体中进行评估时,其得分性能与艺术状态相当。
There is considerable interest in predicting the pathogenicity of protein variants in human genes. Due to the sparsity of high quality labels, recent approaches turn to \textit{unsupervised} learning, using Multiple Sequence Alignments (MSAs) to train generative models of natural sequence variation within each gene. These generative models then predict variant likelihood as a proxy to evolutionary fitness. In this work we instead combine this evolutionary principle with pretrained protein language models (LMs), which have already shown promising results in predicting protein structure and function. Instead of training separate models per-gene, we find that a single protein LM trained on broad sequence datasets can score pathogenicity for any gene variant zero-shot, without MSAs or finetuning. We call this unsupervised approach \textbf{VELM} (Variant Effect via Language Models), and show that it achieves scoring performance comparable to the state of the art when evaluated on clinically labeled variants of disease-related genes.