Abstract
The availability of thousands of individual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size, which is customisable. It fits in a standard computer with 16–32 GB, or even 8 GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries (of average length 150 bp) are handled in average time of 39 µs and with up to 3 mismatches in 373 µs on the test PC with the index size of 13.4 GB. For a smaller index, occupying 7.4 GB in memory, the respective times grow to 76 µs and 917 µs. Software is available at http://sun.aei.polsl.pl/mugi under a free license. Data S1 is available at PLOS One online.
Highlights
About a decade ago, thanks to breakthrough ideas in succinct indexing data structures, it was made clear that a full mammaliansized genome can be stored and used in indexed form in main memory of a commodity workstation
The earliest such attempt, by Sadakane and Shibuya [1], resulted in approximately 2 GB sized compressed suffix array built for the April 2001 draft assembly by Human Genome Project at UCSC. (Obtaining low construction space, was more challenging, later more memory frugal, or disk-based, algorithms for building compressed indexes appeared, see, e.g., [2] and references therein.)
Datasets We are indexing large collection of genomes of the same species, which are represented as the reference genome in FASTA format together with the VCF [31] file, describing all possible reference sequence variations and the genotype information for each of the genome in the dataset
Summary
Thanks to breakthrough ideas in succinct indexing data structures, it was made clear that a full mammaliansized genome can be stored and used in indexed form in main memory of a commodity workstation (equipped with, e.g., 4 GB of RAM) The earliest such attempt, by Sadakane and Shibuya [1], resulted in approximately 2 GB sized compressed suffix array built for the April 2001 draft assembly by Human Genome Project at UCSC. Nowadays, when repositories with a thousand or more genomes are available, the life scientists’ goals are more ambitious, and it is desirable to search for patterns in large genomic collections One application of such a solution could be simultaneous alignment of sequencing reads against multiple genomes [3].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.