论文标题
有效的基于内核的尖峰序列分类
Efficient Approximate Kernel Based Spike Sequence Classification
论文作者
论文摘要
机器学习(ML)模型,例如SVM,用于分类和序列的聚类等任务,需要定义序列对之间的距离/相似性。已经提出了几种方法来计算序列之间的相似性,例如确切的方法计算$ k $ -s-mers(长度$ k $的子序列)之间的匹配数和估计成对相似性分数的近似方法。尽管精确的方法产生了更好的分类性能,但它们的计算成本很高,将其适用性限制在少量序列中。事实证明,近似算法更可扩展,并且可以与确切的方法相当地执行(有时更好) - 它们以“一般”方式设计用于处理不同类型的序列(例如音乐,蛋白质等)。尽管一般适用性是算法的所需属性,但在所有情况下都不是这种情况。例如,在当前的Covid-19(冠状病毒)大流行中,需要一种可以专门处理冠状病毒的方法。为此,我们提出了一系列方法来提高近似内核的性能(使用最小值和信息增益),以增强其预测性能PM冠状病毒序列。更具体地说,我们使用域知识(使用信息增益计算)和有效的预处理(使用最小值计算)来提高近似内核的质量,以对冠状病毒峰值蛋白序列进行分类,这些蛋白质序列序列序列序列序列(例如Alpha,Beta,Gamma)。我们使用不同的分类和聚类算法报告结果,并使用多个评估指标评估其性能。使用两个数据集,我们表明我们提出的方法有助于与医疗保健领域的基线和最新方法相比,有助于提高内核的性能。
Machine learning (ML) models, such as SVM, for tasks like classification and clustering of sequences, require a definition of distance/similarity between pairs of sequences. Several methods have been proposed to compute the similarity between sequences, such as the exact approach that counts the number of matches between $k$-mers (sub-sequences of length $k$) and an approximate approach that estimates pairwise similarity scores. Although exact methods yield better classification performance, they pose high computational costs, limiting their applicability to a small number of sequences. The approximate algorithms are proven to be more scalable and perform comparably to (sometimes better than) the exact methods -- they are designed in a "general" way to deal with different types of sequences (e.g., music, protein, etc.). Although general applicability is a desired property of an algorithm, it is not the case in all scenarios. For example, in the current COVID-19 (coronavirus) pandemic, there is a need for an approach that can deal specifically with the coronavirus. To this end, we propose a series of ways to improve the performance of the approximate kernel (using minimizers and information gain) in order to enhance its predictive performance pm coronavirus sequences. More specifically, we improve the quality of the approximate kernel using domain knowledge (computed using information gain) and efficient preprocessing (using minimizers computation) to classify coronavirus spike protein sequences corresponding to different variants (e.g., Alpha, Beta, Gamma). We report results using different classification and clustering algorithms and evaluate their performance using multiple evaluation metrics. Using two datasets, we show that our proposed method helps improve the kernel's performance compared to the baseline and state-of-the-art approaches in the healthcare domain.