Abstract

BackgroundWe describe the latest improvements to the long-range phasing (LRP) and haplotype library imputation (HLI) algorithms for successful phasing of both datasets with one million individuals and datasets genotyped using different sets of single nucleotide polymorphisms (SNPs). Previous publicly available implementations of the LRP algorithm implemented in AlphaPhase could not phase large datasets due to the computational cost of defining surrogate parents by exhaustive all-against-all searches. Furthermore, the AlphaPhase implementations of LRP and HLI were not designed to deal with large amounts of missing data that are inherent when using multiple SNP arrays.MethodsWe developed methods that avoid the need for all-against-all searches by performing LRP on subsets of individuals and then concatenating the results. We also extended LRP and HLI algorithms to enable the use of different sets of markers, including missing values, when determining surrogate parents and identifying haplotypes. We implemented and tested these extensions in an updated version of AlphaPhase, and compared its performance to the software package Eagle2.ResultsA simulated dataset with one million individuals genotyped with the same 6711 SNPs for a single chromosome took less than a day to phase, compared to more than seven days for Eagle2. The percentage of correctly phased alleles at heterozygous loci was 90.2 and 99.9% for AlphaPhase and Eagle2, respectively. A larger dataset with one million individuals genotyped with 49,579 SNPs for a single chromosome took AlphaPhase 23 days to phase, with 89.9% of alleles at heterozygous loci phased correctly. The phasing accuracy was generally lower for datasets with different sets of markers than with one set of markers. For a simulated dataset with three sets of markers, 1.5% of alleles at heterozygous positions were phased incorrectly, compared to 0.4% with one set of markers.ConclusionsThe improved LRP and HLI algorithms enable AlphaPhase to quickly and accurately phase very large and heterogeneous datasets. AlphaPhase is an order of magnitude faster than the other tested packages, although Eagle2 showed a higher level of phasing accuracy. The speed gain will make phasing achievable for very large genomic datasets in livestock, enabling more powerful breeding and genetics research and application.

Highlights

  • We describe the latest improvements to the long-range phasing (LRP) and haplotype library imputation (HLI) algorithms for successful phasing of both datasets with one million individuals and datasets genotyped using different sets of single nucleotide polymorphisms (SNPs)

  • In this paper, we introduced improvements to the LRP and HLI algorithms of AlphaPhase [2] to enable phasing of very large and heterogeneous datasets in which individuals may have been genotyped on different sets of Correct Unphased Incorrect Correct Unphased Incorrect

  • We further investigated the impact of this parameter with the modifications made to the LRP and HLI algorithms and the use of much denser SNP arrays

Read more

Summary

Introduction

We describe the latest improvements to the long-range phasing (LRP) and haplotype library imputation (HLI) algorithms for successful phasing of both datasets with one million individuals and datasets genotyped using different sets of single nucleotide polymorphisms (SNPs). Phasing genotypes is the process of inferring the parental origin of an individual’s alleles This process resolves the inheritance of chromosome segments in a population and is, as such, a cornerstone technique in genetics. The size of genomic datasets has grown rapidly in recent years, with genotype data from single nucleotide polymorphism (SNP) arrays being collected on increasing numbers of individuals In agriculture, this growth has been driven by the increased use of genomic selection [4,5,6], whereas in human genetics it has been driven by the increased power of genome-wide association studies [7,8,9] and of genomic prediction in human medicine [10]. Examples of such large datasets include the UK Biobank [11], which has recently released SNP genotype data on approximately half a million people [12], and the US Dairy Cattle and Irish Cattle Breeding Federation Databases, which each host genotypes on well over a million animals [6, 13, 14]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call