Abstract

We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.

Highlights

  • The genome sequencing costs dropped recently to less than 5 thousand U.S dollars per human genome with about 30-fold coverage [1]

  • Gk arrays (GkA) is faster than CGkA, yet requiring at least 3 times more space

  • In the Q4 query, given by position, GkA is a clear winner in speed

Read more

Summary

Introduction

The genome sequencing costs dropped recently to less than 5 thousand U.S dollars per human genome with about 30-fold coverage [1]. All this results in enormous amounts of sequencing data. These data have to be processed in some way. They are mapped onto reference genomes and variant calling algorithms are used to identify the mutations present in sequenced genomes. Since the mapping requires fast search over reference genomes, a lot of indexing structures for genomes were adopted or invented. The situation changed with the advent of much more compact (compressed) index data structures. One of the recent successful examples is the MuGI multi-genome index [9], allowing to index 1092 human genomes in less than 10 GB of memory

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call