Compression and fast retrieval of SNP data.

Francesco Sambo,Claudio Cobelli,Barbara Di Camillo,Gianna Toffolo

doi:10.1093/bioinformatics/btu495

Abstract

The increasing interest in rare genetic variants and epistatic genetic effects on complex phenotypic traits is currently pushing genome-wide association study design towards datasets of increasing size, both in the number of studied subjects and in the number of genotyped single nucleotide polymorphisms (SNPs). This, in turn, is leading to a compelling need for new methods for compression and fast retrieval of SNP data. We present a novel algorithm and file format for compressing and retrieving SNP data, specifically designed for large-scale association studies. Our algorithm is based on two main ideas: (i) compress linkage disequilibrium blocks in terms of differences with a reference SNP and (ii) compress reference SNPs exploiting information on their call rate and minor allele frequency. Tested on two SNP datasets and compared with several state-of-the-art software tools, our compression algorithm is shown to be competitive in terms of compression rate and to outperform all tools in terms of time to load compressed data. Our compression and decompression algorithms are implemented in a C++ library, are released under the GNU General Public License and are freely downloadable from http://www.dei.unipd.it/~sambofra/snpack.html.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Compression and fast retrieval of SNP data.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Journal: Bioinformatics	Publication Date: Jul 26, 2014
Citations: 8

Similar Papers

The impact of single nucleotide polymorphism selection on prediction of genomewide breeding values
Kacper Żukowski ... Joanna Szyda
BMC Proceedings | VOL. 3
Kacper Żukowski, et. al.Kacper Żukowski ... Joanna Szyda
23 Feb 2009
BMC Proceedings | VOL. 3

SNP HiTLink: a high-throughput linkage analysis system employing dense SNP data
Yoko Fukuda ... Hidetoshi Date
BMC Bioinformatics | VOL. 10
Yoko Fukuda, et. al.Yoko Fukuda ... Hidetoshi Date
24 Apr 2009
BMC Bioinformatics | VOL. 10

Genome-wide analysis of 10664 SARS-CoV-2 genomes to identify virus strains in 73 countries based on single nucleotide polymorphism
Nimisha Ghosh ... Dariusz Plewczynski
Virus Research | VOL. 298
Nimisha Ghosh, et. al.Nimisha Ghosh ... Dariusz Plewczynski
26 Mar 2021
Virus Research | VOL. 298

Genetic analysis of albuminuria in aging mice and concordance with loci for human diabetic nephropathy found in a genome-wide association scan
Shirng-Wern Tsaih ... Ron Korstanje
Kidney International | VOL. 77
Shirng-Wern Tsaih, et. al.Shirng-Wern Tsaih ... Ron Korstanje
01 Feb 2010
Kidney International | VOL. 77

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Compression and fast retrieval of SNP data.

Abstract

Talk to us

Similar Papers

More From: Bioinformatics