论文标题
基于蛋白质序列的增强学习模型,广泛使用和快速的从头设计模型
Widely Used and Fast De Novo Drug Design by a Protein Sequence-Based Reinforcement Learning Model
论文作者
论文摘要
从头分子设计促进了探索大型化学空间以加速药物的发现。基于结构的新方法可以通过将药物目标相互作用纳入深层生成架构来克服活性配体的数据稀缺。但是,这些策略是通过实验确定的蛋白质或复杂结构的一小部分来瓶颈的。另外,由于分子和蛋白质的3D表示,分子产生的成本在计算上昂贵。在这里,我们展示了一种基于广泛的蛋白质序列增强学习(RL)模型,用于药物发现模型。在生成模型中,奖励成分之一,一种结合亲和力预测指标,基于1D蛋白序列和分子微笑。作为概念证明,RL模型用于设计四个目标的分子。生成的化合物通过用实验3D结合口袋对QSAR和分子对接进行验证显示出生物活性。我们还发现,生成的分子的性能取决于对结合预测指标的数据源训练的选择。此外,我们的模型研究了无实验结构CDK20的激酶的药物设计。只有1D蛋白序列作为输入,生成的新型化合物基于Alphafold预测的结构显示出有利的结合亲和力。
De novo molecular design has facilitated the exploration of large chemical space to accelerate drug discovery. Structure-based de novo method can overcome the data scarcity of active ligands by incorporating drug-target interaction into deep generative architectures. However, these strategies are bottlenecked by the small fraction of experimentally determined protein or complex structures. In addition, the cost of molecular generation is computationally expensive due to 3D representations of both molecule and protein. Here, we demonstrate a widely used and fast protein sequence-based reinforcement learning (RL) model for drug discovery. In the generative model, one of the reward components, a binding affinity predictor, is based on 1D protein sequence and molecular SMILES. As a proof of concept, the RL model was utilized to design molecules for four targets. The generated compounds showed bioactivities by the validation of both QSAR and molecular docking with experimental 3D binding pockets. We also found that the performance of generated molecules depends on the selection of data source training for the binding predictor. Furthermore, drug design for a kinase without any experimental structure, CDK20, was studied by our model. With only 1D protein sequence as input, the generated novel compounds showed favorable binding affinity based on the AlphaFold predicted structure.