论文标题

潜入流感病毒宿主的机器学习算法使用血凝集素序列预测

Dive into Machine Learning Algorithms for Influenza Virus Host Prediction with Hemagglutinin Sequences

论文作者

Xu, Yanhua, Wojtczak, Dominik

论文摘要

流感病毒迅速变异,可能对公共卫生构成威胁,尤其是对弱势群体的人。在整个历史上,流感A病毒在不同物种之间引起了大流行病。重要的是要识别病毒的起源,以防止爆发的传播。最近,人们对使用机器学习算法为病毒序列提供快速准确的预测一直引起了人们的兴趣。在这项研究中,实际测试数据集和各种评估指标用于评估不同分类学水平的机器学习算法。由于血凝素蛋白是免疫反应中的主要蛋白质,因此仅使用了血凝素序列并由特定位置的评分基质和单词嵌入来表示。结果表明,5-grams-transformer神经网络是预测病毒序列起源的最有效算法,在较高的分类水平下,大约99.54%的AUCPR,98.01%的F1分数和96.60%的MCC,以及94.74%AUCPR,约94.74%的AUCPR,87.41%F1评分和80.79%MCC ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATC ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT ATCAT和80.79%降低。

Influenza viruses mutate rapidly and can pose a threat to public health, especially to those in vulnerable groups. Throughout history, influenza A viruses have caused pandemics between different species. It is important to identify the origin of a virus in order to prevent the spread of an outbreak. Recently, there has been increasing interest in using machine learning algorithms to provide fast and accurate predictions for viral sequences. In this study, real testing data sets and a variety of evaluation metrics were used to evaluate machine learning algorithms at different taxonomic levels. As hemagglutinin is the major protein in the immune response, only hemagglutinin sequences were used and represented by position-specific scoring matrix and word embedding. The results suggest that the 5-grams-transformer neural network is the most effective algorithm for predicting viral sequence origins, with approximately 99.54% AUCPR, 98.01% F1 score and 96.60% MCC at a higher classification level, and approximately 94.74% AUCPR, 87.41% F1 score and 80.79% MCC at a lower classification level.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源