Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

Dandi Qiao,Wai-Ki Yip,Christoph Lange

doi:10.1186/1471-2105-13-100

Dandi Qiao, Wai-Ki Yip + Show 1 more

Open Access

https://doi.org/10.1186/1471-2105-13-100

Copy DOI

Abstract

BackgroundAs Next-Generation Sequencing data becomes available, existing hardware environments do not provide sufficient storage space and computational power to store and process the data due to their enormous size. This is and will be a frequent problem that is encountered everyday by researchers who are working on genetic data. There are some options available for compressing and storing such data, such as general-purpose compression software, PBAT/PLINK binary format, etc. However, these currently available methods either do not offer sufficient compression rates, or require a great amount of CPU time for decompression and loading every time the data is accessed.ResultsHere, we propose a novel and simple algorithm for storing such sequencing data. We show that, the compression factor of the algorithm ranges from 16 to several hundreds, which potentially allows SNP data of hundreds of Gigabytes to be stored in hundreds of Megabytes. We provide a C++ implementation of the algorithm, which supports direct loading and parallel loading of the compressed format without requiring extra time for decompression. By applying the algorithm to simulated and real datasets, we show that the algorithm gives greater compression rate than the commonly used compression methods, and the data-loading process takes less time. Also, The C++ library provides direct-data-retrieving functions, which allows the compressed information to be easily accessed by other C++ programs.ConclusionsThe SpeedGene algorithm enables the storage and the analysis of next generation sequencing data in current hardware environment, making system upgrades unnecessary.

Highlights

As Next-Generation Sequencing data becomes available, existing hardware environments do not provide sufficient storage space and computational power to store and process the data due to their enormous size
We show that our algorithm always works better than the compression algorithm implemented in PLINK or PBAT and provides excellent compression rate for sequencing data
To assess the performance of the SpeedGene algorithm, we compare it with the standard LINKAGE/PLINK format and the PLINK/PBAT compression algorithm

Summary

Introduction

As Next-Generation Sequencing data becomes available, existing hardware environments do not provide sufficient storage space and computational power to store and process the data due to their enormous size. There are some options available for compressing and storing such data, such as general-purpose compression software, PBAT/PLINK binary format, etc These currently available methods either do not offer sufficient compression rates, or require a great amount of CPU time for decompression and loading every time the data is accessed. PLINK and PBAT, which are free wholegenome association analysis toolsets, have introduced Binary PED formats [4,5] This format ensures that only 2 Bits are required for storing the information of one genotype.

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: May 16, 2012
Citations: 20	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Assessing transmission attribution risk from simulated sequencing data in HIV molecular epidemiology.
Fabrícia F Nascimento ... Erik M Volz
AIDS (London, England) | VOL. 38
Fabrícia F Nascimento, et. al.Fabrícia F Nascimento ... Erik M Volz
04 Mar 2024
AIDS (London, England) | VOL. 38

Defind: Detecting Genomic Deletions by Integrating Read Depth, GC Content, Mapping Quality and Paired-end Mapping Signatures of Next Generation Sequencing Data
Xin Wang ... Xiaojing Liu
Current Bioinformatics | VOL. 14
Xin Wang, et. al.Xin Wang ... Xiaojing Liu
07 Jan 2019
Current Bioinformatics | VOL. 14

ParStream-seq: An improved method of handling next generation sequence data
Sudip Mondal ... Sunirmal Khatua
Genomics | VOL. 111
Sudip Mondal, et. al.Sudip Mondal ... Sunirmal Khatua
15 Nov 2018
Genomics | VOL. 111

Estimating DNA polymorphism from next generation sequencing data with high error rate by dual sequencing applications
Ziwen He ... Suhua Shi
BMC Genomics | VOL. 14
Ziwen He, et. al.Ziwen He ... Suhua Shi
07 Aug 2013
BMC Genomics | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Handling the data management needs of high-throughput sequencing data: SpeedGene, a compression algorithm for the efficient storage of genetic data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics