基因组对：通过稀疏基因组驯服大规模基因组分析

论文标题

基因组对：通过稀疏基因组驯服大规模基因组分析

Genome-on-Diet: Taming Large-Scale Genomic Analyses via Sparsified Genomics

论文作者

Alser, Mohammed, Eudine, Julien, Mutlu, Onur

论文摘要

搜索相似的基因组序列是生物医学研究中的重要和基本步骤，并且绝大多数基因组分析。进行此类比较的最先进的计算方法无法应对基因组测序数据的指数增长。我们介绍了稀疏基因组学的概念，在该概念中，我们从基因组序列中系统地排除了大量碱基，并能够对稀疏，较短的基因组序列进行更快，更高的记忆效率处理，同时与处理非sarparsified序列相比提供了相似甚至更高的精度。稀疏的基因组学为许多基因组分析提供了重大好处，并且具有广泛的适用性。我们表明，稀疏基因组序列极大地加速了最新的读取映射器（MiniMAP2）2.57-5.38x，1.13-2.78x和3.52-6.28x，分别使用Real Illumina，hifi和ont读取分别提供了较小的记忆尺寸和2.1x小规模，并分别提供了2.1x较小的尺寸，并分别提供了2.1x的尺寸，并分别提供了2.1x的尺寸，并提供了2.1x的尺寸。 minimap2。稀疏基因组序列通过非常大的基因组和大数据库搜索搜索速度72.7-75.88 x，比通过非固定基因组序列进行搜索（使用CMASH和KMC3）要高723.3倍。稀疏基因组序列可以通过比最先进的工具（Metalign）提供54.15-61.88倍的鲁棒微生物组发现，并提供54.15-61.88倍的较高储存效率的分类分析。我们将一个名为“基因组”的框架设计和开源，作为稀疏基因组学的示例工具，可以从https://github.com/cmu-safari/genome-genome-nome-on-diet免费下载。

Searching for similar genomic sequences is an essential and fundamental step in biomedical research and an overwhelming majority of genomic analyses. State-of-the-art computational methods performing such comparisons fail to cope with the exponential growth of genomic sequencing data. We introduce the concept of sparsified genomics where we systematically exclude a large number of bases from genomic sequences and enable much faster and more memory-efficient processing of the sparsified, shorter genomic sequences, while providing similar or even higher accuracy compared to processing non-sparsified sequences. Sparsified genomics provides significant benefits to many genomic analyses and has broad applicability. We show that sparsifying genomic sequences greatly accelerates the state-of-the-art read mapper (minimap2) by 2.57-5.38x, 1.13-2.78x, and 3.52-6.28x using real Illumina, HiFi, and ONT reads, respectively, while providing up to 2.1x smaller memory footprint, 2x smaller index size, and more truly detected small and structural variations compared to minimap2. Sparsifying genomic sequences makes containment search through very large genomes and large databases 72.7-75.88x faster and 723.3x more storage-efficient than searching through non-sparsified genomic sequences (with CMash and KMC3). Sparsifying genomic sequences enables robust microbiome discovery by providing 54.15-61.88x faster and 720x more storage-efficient taxonomic profiling of metagenomic samples over the state-of-the-art tool (Metalign). We design and open-source a framework called Genome-on-Diet as an example tool for sparsified genomics, which can be freely downloaded from https://github.com/CMU-SAFARI/Genome-on-Diet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题