一种新型的可扩展Apache Spark基于巨大蛋白质序列的特征提取方法及其聚类性能分析

论文标题

一种新型的可扩展Apache Spark基于巨大蛋白质序列的特征提取方法及其聚类性能分析

A Novel Scalable Apache Spark Based Feature Extraction Approaches for Huge Protein Sequence and their Clustering Performance Analysis

论文作者

Jha, Preeti, Tiwari, Aruna, Bharill, Neha, Ratnaparkhe, Milind, Patel, Om Prakash, Harshith, Nilagiri, Mounika, Mukkamalla, Nagendra, Neha

论文摘要

基因组测序项目正在迅速增加高维蛋白序列数据集的数量。使用传统的机器学习方法聚集高维蛋白序列数据集会带来许多挑战。存在许多不同的特征提取方法，并被广泛使用。但是，从数百万蛋白质序列中提取特征变得不切实际，因为它们与当前算法不可扩展。因此，需要采用有效的特征提取方法来提取重要特征。我们提出了两种可扩展的特征提取方法，用于使用Apache Spark从巨大的蛋白质序列中提取特征，这些方法称为60D-SPF（60维可伸缩蛋白特征）和6D-SCPSF（6D-SD-SCOLABLE COSICOSINAL COSIOTINAL COSICONINAL ECORCORENCE-CORCORNECE-CORRENCE概率特异性特征）。提出的60D-SPF和6D-SCPSF方法捕获了氨基酸的统计特性，以创建固定长度的数字特征矢量，该数字矢量分别以60维和6维特征表示每个蛋白质序列。预处理的巨大蛋白质序列在两种聚类算法中用作输入，即具有迭代优化的模糊C-Means（SRSIO-FCM）的可扩展随机采样（SRSIO-FCM）和可扩展的文字模糊C-MEANS（SLFCM）。我们已经在各种大豆蛋白数据集上进行了广泛的实验，以证明所提出的特征提取方法，60D-SPF，6D-SCPSF，以及SRSIO-FCM和SLFCM群集聚类算法上的现有特征提取方法。根据轮廓指数和Davies-Bouldin指数的报道结果表明，SRSIO-FCM和SLFCM聚类算法上提出的60D-SPF提取方法比拟议的6D-SCPSF和现有的提取方法取得了明显更好的结果。

Genome sequencing projects are rapidly increasing the number of high-dimensional protein sequence datasets. Clustering a high-dimensional protein sequence dataset using traditional machine learning approaches poses many challenges. Many different feature extraction methods exist and are widely used. However, extracting features from millions of protein sequences becomes impractical because they are not scalable with current algorithms. Therefore, there is a need for an efficient feature extraction approach that extracts significant features. We have proposed two scalable feature extraction approaches for extracting features from huge protein sequences using Apache Spark, which are termed 60d-SPF (60-dimensional Scalable Protein Feature) and 6d-SCPSF (6-dimensional Scalable Co-occurrence-based Probability-Specific Feature). The proposed 60d-SPF and 6d-SCPSF approaches capture the statistical properties of amino acids to create a fixed-length numeric feature vector that represents each protein sequence in terms of 60-dimensional and 6-dimensional features, respectively. The preprocessed huge protein sequences are used as an input in two clustering algorithms, i.e., Scalable Random Sampling with Iterative Optimization Fuzzy c-Means (SRSIO-FCM) and Scalable Literal Fuzzy C-Means (SLFCM) for clustering. We have conducted extensive experiments on various soybean protein datasets to demonstrate the effectiveness of the proposed feature extraction methods, 60d-SPF, 6d-SCPSF, and existing feature extraction methods on SRSIO-FCM and SLFCM clustering algorithms. The reported results in terms of the Silhouette index and the Davies-Bouldin index show that the proposed 60d-SPF extraction method on SRSIO-FCM and SLFCM clustering algorithms achieves significantly better results than the proposed 6d-SCPSF and existing feature extraction approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题