Indexes of large genome collections on a PC.

Agnieszka Danek,Sebastian Deorowicz,Szymon Grabowski

doi:10.1371/journal.pone.0109384

Agnieszka Danek, Sebastian Deorowicz + Show 1 more

Open Access

https://doi.org/10.1371/journal.pone.0109384

Copy DOI

Abstract

The availability of thousands of individual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size, which is customisable. It fits in a standard computer with 16–32 GB, or even 8 GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries (of average length 150 bp) are handled in average time of 39 µs and with up to 3 mismatches in 373 µs on the test PC with the index size of 13.4 GB. For a smaller index, occupying 7.4 GB in memory, the respective times grow to 76 µs and 917 µs. Software is available at http://sun.aei.polsl.pl/mugi under a free license. Data S1 is available at PLOS One online.

Highlights

About a decade ago, thanks to breakthrough ideas in succinct indexing data structures, it was made clear that a full mammaliansized genome can be stored and used in indexed form in main memory of a commodity workstation
The earliest such attempt, by Sadakane and Shibuya [1], resulted in approximately 2 GB sized compressed suffix array built for the April 2001 draft assembly by Human Genome Project at UCSC. (Obtaining low construction space, was more challenging, later more memory frugal, or disk-based, algorithms for building compressed indexes appeared, see, e.g., [2] and references therein.)
Datasets We are indexing large collection of genomes of the same species, which are represented as the reference genome in FASTA format together with the VCF [31] file, describing all possible reference sequence variations and the genotype information for each of the genome in the dataset

Summary

Introduction

Thanks to breakthrough ideas in succinct indexing data structures, it was made clear that a full mammaliansized genome can be stored and used in indexed form in main memory of a commodity workstation (equipped with, e.g., 4 GB of RAM) The earliest such attempt, by Sadakane and Shibuya [1], resulted in approximately 2 GB sized compressed suffix array built for the April 2001 draft assembly by Human Genome Project at UCSC. Nowadays, when repositories with a thousand or more genomes are available, the life scientists’ goals are more ambitious, and it is desirable to search for patterns in large genomic collections One application of such a solution could be simultaneous alignment of sequencing reads against multiple genomes [3].

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLoS ONE	Publication Date: Oct 7, 2014
Citations: 43	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Indexes of large genome collections on a PC.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE

Lead the way for us

Similar Papers

Semiglobal exponential input-to-state stability of sampled-data systems based on approximate discrete-time models
Alexis J Vallarella ... Hernan Haimovich
Automatica | VOL. 131
Alexis J Vallarella, et. al.Alexis J Vallarella ... Hernan Haimovich
09 Jun 2021
Automatica | VOL. 131

Sampled-data stabilization for a kind of stochastic nonlinear systems driven by G-Brownian motion
Chunhan Liu ... Qianqian Zhang
-
Chunhan Liu, et. al.Chunhan Liu ... Qianqian Zhang
01 Nov 2019
01 Nov 2019

Observer design for sampled-data nonlinear systems via approximate discrete-time models
M Areak ... D Nesic
-
M Areak, et. al.M Areak ... D Nesic
09 Dec 2003
09 Dec 2003

State Measurement Error-to-State Stability Results Based on Approximate Discrete-Time Models
Alexis J Vallarella ... Hernan Haimovich
IEEE Transactions on Automatic Control | VOL. 64
Alexis J Vallarella, et. al.Alexis J Vallarella ... Hernan Haimovich
01 Aug 2019
IEEE Transactions on Automatic Control | VOL. 64

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Indexes of large genome collections on a PC.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE