Relso：一种基于变压器的潜在空间优化和蛋白质生成的模型

论文标题

Relso：一种基于变压器的潜在空间优化和蛋白质生成的模型

ReLSO: A Transformer-based Model for Latent Space Optimization and Generation of Proteins

论文作者

Castro, Egbert, Godavarthi, Abhinav, Rubinfien, Julian, Givechian, Kevin B., Bhaskar, Dhananjay, Krishnaswamy, Smita

论文摘要

强大的自然语言模型的发展提高了学习蛋白质序列有意义表示的能力。此外，高通量诱变，定向进化和下一代测序的进展允许积累大量标记的适应性数据。利用这两种趋势，我们引入了正规化潜在空间优化（RELSO），这是一种基于深度变压器的自动编码器，具有高度结构化的潜在空间，经过训练，可以共同生成序列并预测适应性。通过正则预测头，RelSO引入了强大的蛋白质序列编码器和新颖的方法，以进行有效的适应性景观遍历。使用Relso，我们明确对大型标记数据集的序列功能景观进行了建模，并通过使用基于梯度的方法在潜在空间中优化潜在空间来生成新分子。我们在几个公开可用的蛋白质数据集上评估了这种方法，包括抗ranibizumab和GFP的变体集。与其他方法相比，我们观察到较高的序列优化效率（每次优化步骤提高适应性步骤），在这些方法中，相关性更强大地产生高素质序列。此外，由共同训练的RERSO模型学到的基于注意力的关系为序列级适应性归因信息提供了潜在的途径。

The development of powerful natural language models have increased the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution, and next-generation sequencing have allowed for the accumulation of large amounts of labeled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder which features a highly structured latent space that is trained to jointly generate sequences as well as predict fitness. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and novel approach for efficient fitness landscape traversal. Using ReLSO, we explicitly model the sequence-function landscape of large labeled datasets and generate new molecules by optimizing within the latent space using gradient-based methods. We evaluate this approach on several publicly-available protein datasets, including variant sets of anti-ranibizumab and GFP. We observe a greater sequence optimization efficiency (increase in fitness per optimization step) by ReLSO compared to other approaches, where ReLSO more robustly generates high-fitness sequences. Furthermore, the attention-based relationships learned by the jointly-trained ReLSO models provides a potential avenue towards sequence-level fitness attribution information.

下载PDF全文

下载文献需遵守相关版权规定

论文标题