论文标题
具有2440个有机分子的基准测量数据集,用于稳态分布
A Benchmarking Dataset with 2440 Organic Molecules for Volume Distribution at Steady State
论文作者
论文摘要
背景:稳态(VDSS)的分布量是药物的基本药代动力学(PK)特性,该特性衡量了药物分子在整个体内分布的有效分布。与清除率(CL)一起,它决定了半衰期,因此决定了药物给药间隔。但是,分子数据大小限制了报告的机器学习模型的普遍性。目的:本研究旨在为人类VDS提供一个干净,全面的数据集,作为基准数据源,促进并受益于未来的预测研究。此外,还使用机器学习回归算法构建了几种预测模型。方法:该数据集是从13个公开访问的数据源中策划的,并且完全从静脉注射药物管理中策划了药品银行数据库,然后进行了广泛的数据清洁。用mordred计算分子描述符,并进行了特征选择以构建预测模型。使用五种机器学习方法来构建回归模型,使用网格搜索来优化超参数,并使用十倍的交叉验证来评估模型。结果:用2440个分子构建了VDSS的丰富数据集(https://github.com/da-wen-er/vdss)。在预测模型中,LightGBM模型是最稳定的,具有最佳的内部预测能力,Q2 = 0.837,R2 = 0.814,对于其他四个模型,Q2高于0.79。结论:据我们所知,这是VDS的最大数据集,可以用作VDS计算研究的基准。此外,本研究中报告的回归模型可以用于药代动力学相关研究。
Background: The volume of distribution at steady state (VDss) is a fundamental pharmacokinetics (PK) property of drugs, which measures how effectively a drug molecule is distributed throughout the body. Along with the clearance (CL), it determines the half-life and, therefore, the drug dosing interval. However, the molecular data size limits the generalizability of the reported machine learning models. Objective: This study aims to provide a clean and comprehensive dataset for human VDss as the benchmarking data source, fostering and benefiting future predictive studies. Moreover, several predictive models were also built with machine learning regression algorithms. Methods: The dataset was curated from 13 publicly accessible data sources and the DrugBank database entirely from intravenous drug administration and then underwent extensive data cleaning. The molecular descriptors were calculated with Mordred, and feature selection was conducted for constructing predictive models. Five machine learning methods were used to build regression models, grid search was used to optimize hyperparameters, and ten-fold cross-validation was used to evaluate the model. Results: An enriched dataset of VDss (https://github.com/da-wen-er/VDss) was constructed with 2440 molecules. Among the prediction models, the LightGBM model was the most stable and had the best internal prediction ability with Q2 = 0.837, R2=0.814 and for the other four models, Q2 was higher than 0.79. Conclusions: To the best of our knowledge, this is the largest dataset for VDss, which can be used as the benchmark for computational studies of VDss. Moreover, the regression models reported within this study can be of use for pharmacokinetic related studies.