论文标题
使用上下文镜头的固定长度蛋白质嵌入
Fixed-Length Protein Embeddings using Contextual Lenses
论文作者
论文摘要
基本的本地对齐搜索工具(BLAST)是目前最流行的搜索生物序列数据库的方法。 BLAST通过加权编辑距离定义的相似性比较序列,这导致其计算昂贵。与使用编辑距离相比,可以使用现代硬件或哈希技术实质上加速向量相似性方法。这种方法将需要固定长度的生物序列嵌入。最近,在使用深度学习模型学习固定长度的蛋白质嵌入的假设下,人们一直在学习固定长度的蛋白质嵌入,即监督或半监督模型的隐藏层可以产生潜在有用的矢量嵌入。我们考虑了在Trembl数据集上鉴定的Transformer(BERT)蛋白质语言模型,并通过上下文镜头在其顶部学习固定长度的嵌入。训练嵌入以预测蛋白质属于PFAM数据库中的序列。我们表明,对于最近的邻居家庭分类,预处理可显着提高性能,相应的学习嵌入与BLAST具有竞争力。此外,我们表明,通过静态合并获得的原始变压器嵌入在最近的邻居家庭分类上表现不佳,这表明通过上下文镜头以监督的方式学习嵌入可能是对微调的计算替代方法。
The Basic Local Alignment Search Tool (BLAST) is currently the most popular method for searching databases of biological sequences. BLAST compares sequences via similarity defined by a weighted edit distance, which results in it being computationally expensive. As opposed to working with edit distance, a vector similarity approach can be accelerated substantially using modern hardware or hashing techniques. Such an approach would require fixed-length embeddings for biological sequences. There has been recent interest in learning fixed-length protein embeddings using deep learning models under the hypothesis that the hidden layers of supervised or semi-supervised models could produce potentially useful vector embeddings. We consider transformer (BERT) protein language models that are pretrained on the TrEMBL data set and learn fixed-length embeddings on top of them with contextual lenses. The embeddings are trained to predict the family a protein belongs to for sequences in the Pfam database. We show that for nearest-neighbor family classification, pretraining offers a noticeable boost in performance and that the corresponding learned embeddings are competitive with BLAST. Furthermore, we show that the raw transformer embeddings, obtained via static pooling, do not perform well on nearest-neighbor family classification, which suggests that learning embeddings in a supervised manner via contextual lenses may be a compute-efficient alternative to fine-tuning.