Referential genome sequence compression with low memory consumption

Lu Zhiwen,Chen Jianhua,Wang Rongshu

doi:10.1117/12.2631583

Abstract

With the rapid development of genome sequencing technology, a large amount of genome data has been generated, it also brings the storage problem of this massive data. Therefore, the compression of genome data has become a research hotspot. We propose a new genome data compression algorithm called LCMRGC (low memory consumption referential genome compressor) for FASTA format sequences. The algorithm uses the suffix array data structure to support the search of matching strings, and uses the binary search method to accelerate accurate matching, so as to obtain better compression ratio. Experiment results on standard genome data show that the proposed algorithm significantly reduces the memory requirement for program operation, and is competitive in compression ratio and compression time.

Full Text