A bucket index correction based method for compression of genomic sequencing data

Rongjie Wang,Tianyi Zang,Yadong Wang,Qianlong Cheng,Yang Bai

doi:10.1109/bibm.2017.8217727

Abstract

As high-throughput sequencing technologies are generating vast amounts of data, there is urgent need to develop efficient algorithms for sequencing data compression. Existing methods usually dispatch the similar sequences into the same bucket based on their same minimizer, that is the lexicographical smallest k-mer within the sequence, for data compression. However, when the sequencing error existed in the minimizer area, it could cause sequences to be distributed into the improper buckets, which could result in a negative effect in the following compression process. In this paper, we propose a novel method BIC, a bucket index correction method for sequencing data compression. BIC is the first method to correct sequencing errors in minimizer area, which dispatches more similar sequences into the same buckets, that could effectively compress sequencing data. Compared with three state-of-the-art methods on five different data sets, BIC could reach more compression rate. The codes of BIC are available at https://github.com/rongjiewang/BIC.

Full Text