CSHAP: efficient haplotype frequency estimation based on sparse representation.

Yinsheng Zhou,Han Zhang,Yaning Yang,Jonathan Wren

doi:10.1093/bioinformatics/bty1040

Abstract

Estimating haplotype frequencies from genotype data plays an important role in genetic analysis. In silico methods are usually computationally involved since phase information is not available. Due to tight linkage disequilibrium and low recombination rates, the number of haplotypes observed in human populations is far less than all the possibilities. This motivates us to solve the estimation problem by maximizing the sparsity of existing haplotypes. Here, we propose a new algorithm by applying the compressive sensing (CS) theory in the field of signal processing, compressive sensing haplotype inference (CSHAP), to solve the sparse representation of haplotype frequencies based on allele frequencies and between-allele co-variances. Our proposed approach can handle both individual genotype data and pooled DNA data with hundreds of loci. The CSHAP exhibits the same accuracy compared with the state-of-the-art methods, but runs several orders of magnitude faster. CSHAP can also handle with missing genotype data imputations efficiently. The CSHAP is implemented in R, the source code and the testing datasets are available at http://home.ustc.edu.cn/∼zhouys/CSHAP/. Supplementary data are available at Bioinformatics online.

Full Text