Fast imputation using medium or low-coverage sequence data.

Paul M Vanraden,Jeffrey R O’Connell,Chuanyu Sun

doi:10.1186/s12863-015-0243-7

Paul M Vanraden, Jeffrey R O’Connell + Show 1 more

Open Access

PDF Available

https://doi.org/10.1186/s12863-015-0243-7

Copy DOI

Export

Save

Cite

Journal: BMC Genetics	Publication Date: Jul 14, 2015
Citations: 80	License type: CC BY 4.0

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundAccurate genotype imputation can greatly reduce costs and increase benefits by combining whole-genome sequence data of varying read depth and array genotypes of varying densities. For large populations, an efficient strategy chooses the two haplotypes most likely to form each genotype and updates posterior allele probabilities from prior probabilities within those two haplotypes as each individual’s sequence is processed. Directly using allele read counts can improve imputation accuracy and reduce computation compared with calling or computing genotype probabilities first and then imputing.ResultsA new algorithm was implemented in findhap (version 4) software and tested using simulated bovine and actual human sequence data with different combinations of reference population size, sequence read depth and error rate. Read depths of ≥8× may be desired for direct investigation of sequenced individuals, but for a given total cost, sequencing more individuals at read depths of 2× to 4× gave more accurate imputation from array genotypes. Imputation accuracy improved further if reference individuals had both low-coverage sequence and high-density (HD) microarray data, and remained high even with a read error rate of 16 %. With read depths of ≤4×, findhap (version 4) had higher accuracy than Beagle (version 4); computing time was up to 400 times faster with findhap than with Beagle. For 10,000 sequenced individuals plus 250 with HD array genotypes to test imputation, findhap used 7 hours, 10 processors and 50 GB of memory for 1 million loci on one chromosome. Computing times increased in proportion to population size but less than proportional to number of variants.ConclusionsSimultaneous genotype calling from low-coverage sequence data and imputation from array genotypes of various densities is done very efficiently within findhap by updating allele probabilities within the two haplotypes for each individual. Accuracy of genotype calling and imputation were high with both simulated bovine and actual human genomes reduced to low-coverage sequence and HD microarray data. More efficient imputation allows geneticists to locate and test effects of more DNA variants from more individuals and to include those in future prediction and selection.

Highlights

Accurate genotype imputation can greatly reduce costs and increase benefits by combining whole-genome sequence data of varying read depth and array genotypes of varying densities
Sequence genotypes can be imputed accurately by combining information from individuals sequenced at lower coverage and from those genotyped with less expensive single nucleotide polymorphism (SNP) arrays
Computing costs are low for version 4 of findhap [31] and increase in linear proportion to population size but less than proportionally to Single nucleotide variant (SNV) density (Table 1)

Summary

Introduction

Accurate genotype imputation can greatly reduce costs and increase benefits by combining whole-genome sequence data of varying read depth and array genotypes of varying densities. Using allele read counts can improve imputation accuracy and reduce computation compared with calling or computing genotype probabilities first and imputing. Genotype imputation greatly reduces cost and increases accuracy of estimating genetic effects by increasing the ratio of output to input data, but algorithms must be efficient because numbers of genotypes to impute may increase faster than computer resources. International exchange of sequence data has provided whole genomes for thousands of humans [1] and hundreds of bulls [2]. Array genotypes are available for hundreds of thousands of Genome sequencing directly interrogates the genetic variation that underlies quantitative traits and disease susceptibility and enables a better understanding of biology.

Methods

Results

Discussion

Conclusion