Abstract

K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.

Highlights

  • Many of today’s bioinformatics analyses are powered by tools that are k-mer based

  • We present two algorithms for the disk compression of k-mer sets, ESS-Compress and ESS-Tip-Compress

  • The two algorithms present a tradeoff between time/memory and compression size, which we explore in our results

Read more

Summary

Introduction

Many of today’s bioinformatics analyses are powered by tools that are k-mer based. These tools first reduce the input sequence data, which may be of various lengths and type, to a set of short fixed length strings called kmers. For every edge we add to our path cover, we glue these two unitigs and remove one duplicate instance of the (k − 1) -mer from the corresponding SPSS. Main algorithm Our starting point is a set of canonical k-mers K, the graph cdBG(K), and a vertex-disjoint normalized path cover of cdBG(K) returned by UST.1 To develop the intuition for our algorithm, we first consider a simple example (Fig. 1A).

Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.