论文标题
用于基于文本的检索和编辑的多模式分子结构文本模型
Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing
论文作者
论文摘要
在药物发现中,人工智能的采用越来越多。但是,现有研究使用机器学习主要利用分子的化学结构,但忽略了化学中可用的庞大文本知识。结合文本知识使我们能够实现新的药物设计目标,适应基于文本的说明并预测复杂的生物学活动。在这里,我们通过对比度学习策略共同学习分子的化学结构和文本描述,提出了多模式分子结构文本模型,分子。为了训练分子,我们构建了一个大型多模式数据集,即PubChemstM,具有超过280,000个化学结构文本对。为了证明分子的有效性和效用,我们根据文本指令设计了两个具有挑战性的零射击任务,包括结构文本检索和分子编辑。 MoleculeStm具有两个主要特性:通过自然语言开放的词汇和组成性。在实验中,MoleculeStm获得了跨各种基准的生化概念的最新概括能力。
There is increasing adoption of artificial intelligence in drug discovery. However, existing studies use machine learning to mainly utilize the chemical structures of molecules but ignore the vast textual knowledge available in chemistry. Incorporating textual knowledge enables us to realize new drug design objectives, adapt to text-based instructions and predict complex biological activities. Here we present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecules' chemical structures and textual descriptions via a contrastive learning strategy. To train MoleculeSTM, we construct a large multi-modal dataset, namely, PubChemSTM, with over 280,000 chemical structure-text pairs. To demonstrate the effectiveness and utility of MoleculeSTM, we design two challenging zero-shot tasks based on text instructions, including structure-text retrieval and molecule editing. MoleculeSTM has two main properties: open vocabulary and compositionality via natural language. In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts across various benchmarks.