分析DNA序列数据的压缩技术

论文标题

分析DNA序列数据的压缩技术

Analysis of Compression Techniques for DNA Sequence Data

论文作者

Bibi, Shakeela, Iqbal, Javed, Iftekhar, Adnan, Hassan, Mir

论文摘要

生物学数据主要包括脱氧核糖核酸（DNA）和蛋白质序列。这些是人类所有细胞中存在的生物分子。由于DNA的自我复制特性，它是所有呼吸中存在的遗传物质的关键。该生物分子（DNA）理解了所有人格化生活的运营和扩展的遗传物质。为了节省单人的DNA数据，我们需要10CD-ROM。此外，这种尺寸不断增加，并且在公共数据库中添加了越来越多的序列。序列数据的大量增加在此数据的精确信息提取中引起了挑战。由于许多数据分析和可视化工具不支持处理大量数据的处理。为了降低DNA和蛋白质序列的大小，许多科学家引入了各种类型的序列压缩算法，例如压缩或GZIP，上下文树的加权（CTW），Lampel Ziv Welch（LZW），算术编码，运行长度编码和替代方法等。另一方面，传统的压缩技术也不适合压缩这些类型的顺序数据。在本文中，我们探索了多种类型的技术，以压缩大量DNA序列数据。在本文中，对技术的分析表明，有效的技术不仅减少了序列的大小，而且还避免了任何信息丢失。现有研究的综述还表明，除了提高存储效率和数据传输外，DNA序列的压缩对于理解DNA数据的临界特征也很重要。此外，蛋白质序列的压缩是研究界的挑战。评估这些压缩算法的主要参数包括压缩比，运行时间复杂性等。

Biological data mainly comprises of Deoxyribonucleic acid (DNA) and protein sequences. These are the biomolecules which are present in all cells of human beings. Due to the self-replicating property of DNA, it is a key constitute of genetic material that exist in all breathingcreatures. This biomolecule (DNA) comprehends the genetic material obligatory for the operational and expansion of all personified lives. To save DNA data of single person we require 10CD-ROMs.Moreover, this size is increasing constantly, and more and more sequences are adding in the public databases. This abundant increase in the sequence data arise challenges in the precise information extraction from this data. Since many data analyzing and visualization tools do not support processing of this huge amount of data. To reduce the size of DNA and protein sequence, many scientists introduced various types of sequence compression algorithms such as compress or gzip, Context Tree Weighting (CTW), Lampel Ziv Welch (LZW), arithmetic coding, run-length encoding and substitution method etc. These techniques have sufficiently contributed to minimizing the volume of the biological datasets. On the other hand, traditional compression techniques are also not much suitable for the compression of these types of sequential data. In this paper, we have explored diverse types of techniques for compression of large amounts of DNA Sequence Data. In this paper, the analysis of techniques reveals that efficient techniques not only reduce the size of the sequence but also avoid any information loss. The review of existing studies also shows that compression of a DNA sequence is significant for understanding the critical characteristics of DNA data in addition to improving storage efficiency and data transmission. In addition, the compression of the protein sequence is a challenge for the research community. The major parameters for evaluation of these compression algorithms include compression ratio, running time complexity etc.

下载PDF全文

下载文献需遵守相关版权规定

论文标题