HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data.

Haichang Yao,Yimu Ji,Shangdong Liu,Kui Li,Ruchuan Wang,Jing He

doi:10.1155/2019/3108950

Abstract

With the maturity of genome sequencing technology, huge amounts of sequence reads as well as assembled genomes are generating. With the explosive growth of genomic data, the storage and transmission of genomic data are facing enormous challenges. FASTA, as one of the main storage formats for genome sequences, is widely used in the Gene Bank because it eases sequence analysis and gene research and is easy to be read. Many compression methods for FASTA genome sequences have been proposed, but they still have room for improvement. For example, the compression ratio and speed are not so high and robust enough, and memory consumption is not ideal, etc. Therefore, it is of great significance to improve the efficiency, robustness, and practicability of genomic data compression to reduce the storage and transmission cost of genomic data further and promote the research and development of genomic technology. In this manuscript, a hybrid referential compression method (HRCM) for FASTA genome sequences is proposed. HRCM is a lossless compression method able to compress single sequence as well as large collections of sequences. It is implemented through three stages: sequence information extraction, sequence information matching, and sequence information encoding. A large number of experiments fully evaluated the performance of HRCM. Experimental verification shows that HRCM is superior to the best-known methods in genome batch compression. Moreover, HRCM memory consumption is relatively low and can be deployed on standard PCs.

Highlights

Since the launch of the International Human Genome Project in 1990, the emergence of high-throughput sequencing technologies, such as single-molecule sequencing technology [1] and next-generation sequencing (NGS) technology [2], has led to the reduction in the cost of genome sequencing and the improvement in the speed of sequencing [3]
Similar to the testing scheme used in HiRGC and SCCG, in our experiments, the same eight genomes of H. sapiens were selected and each genome was used in turn as the reference genome to compress other genomes, so as to exclude the contingency caused by the selection of reference and fully evaluate the robustness and practicability of the method. iDoComp and NRGC failed to compress some genome sequences, and we compressed the original file with the PPMD compression algorithm as the description in the original papers
E compressed file sizes of hybrid referential compression method (HRCM) and the six compared methods together with the corresponding improvement of HRCM-B over other methods are summarized in Table 4. e original file size and compressed file size in the table are the sum of original file sizes and the sum of the compressed file sizes of the 7 to-be-compressed genomes

Summary

Introduction

Since the launch of the International Human Genome Project in 1990, the emergence of high-throughput sequencing technologies, such as single-molecule sequencing technology [1] and next-generation sequencing (NGS) technology [2], has led to the reduction in the cost of genome sequencing and the improvement in the speed of sequencing [3]. Many countries and organizations have launched genomic engineering projects [4,5,6]. As a variety of sequencing projects unfold, the amount of genomic data generated is exploding, and the growth rate will be faster in the future. By 2025, the genomic data alone will be increased at a rate of 1 zettabase/year (1 Z 1021) [7]. Genomic data is growing faster than storage and transmission bandwidth, putting a lot of pressure on storage and data transmission [8,9,10]. How to store genomic data efficiently and reduce the pressure of storage and data migration is of great significance in genomic research and application [11]

Methods

Results

Conclusion