论文标题
用于构建生物可靠蛋白质语言模型的语言灵感的路线图
Linguistically inspired roadmap for building biologically reliable protein language models
论文作者
论文摘要
基于神经网络的深层语言模型(LMS)越来越多地应用于大规模蛋白质序列数据以预测蛋白质功能。然而,作为黑盒模型,当前的蛋白质LM方法在很大程度上是挑战性的,因此对序列功能映射的基本理解并不促进基于规则的生物治疗药物开发的基本理解。我们认为,从语言学中得出的指导是从自然语言数据中提取分析规则的领域,可以帮助构建更容易解释的蛋白质LM,这些蛋白质LMS更有可能学习相关领域的特定规则。与自然语言LMS相比,蛋白质序列数据和语言序列数据之间的差异需要在蛋白质LMS中集成更多的域特异性知识。在这里,我们为培训数据,令牌化,令牌嵌入,序列嵌入和模型解释提供了基于语言学的路线图。将语言思想纳入蛋白质LMS,可以开发下一代可解释的机器学习模型,并有可能发现生物学机制的基础序列功能关系。
Deep neural-network-based language models (LMs) are increasingly applied to large-scale protein sequence data to predict protein function. However, being largely black-box models and thus challenging to interpret, current protein LM approaches do not contribute to a fundamental understanding of sequence-function mappings, hindering rule-based biotherapeutic drug development. We argue that guidance drawn from linguistics, a field specialized in analytical rule extraction from natural language data, can aid with building more interpretable protein LMs that are more likely to learn relevant domain-specific rules. Differences between protein sequence data and linguistic sequence data require the integration of more domain-specific knowledge in protein LMs compared to natural language LMs. Here, we provide a linguistics-based roadmap for protein LM pipeline choices with regard to training data, tokenization, token embedding, sequence embedding, and model interpretation. Incorporating linguistic ideas into protein LMs enables the development of next-generation interpretable machine-learning models with the potential of uncovering the biological mechanisms underlying sequence-function relationships.