Abstract
We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.
Highlights
The genome sequencing costs dropped recently to less than 5 thousand U.S dollars per human genome with about 30-fold coverage [1]
Gk arrays (GkA) is faster than CGkA, yet requiring at least 3 times more space
In the Q4 query, given by position, GkA is a clear winner in speed
Summary
The genome sequencing costs dropped recently to less than 5 thousand U.S dollars per human genome with about 30-fold coverage [1]. All this results in enormous amounts of sequencing data. These data have to be processed in some way. They are mapped onto reference genomes and variant calling algorithms are used to identify the mutations present in sequenced genomes. Since the mapping requires fast search over reference genomes, a lot of indexing structures for genomes were adopted or invented. The situation changed with the advent of much more compact (compressed) index data structures. One of the recent successful examples is the MuGI multi-genome index [9], allowing to index 1092 human genomes in less than 10 GB of memory
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.