Abstract

BackgroundThe increasing availability of whole-genome sequence data is expected to increase the accuracy of genomic prediction. However, results from simulation studies and analysis of real data do not always show an increase in accuracy from sequence data compared to high-density (HD) single nucleotide polymorphism (SNP) chip genotypes. In addition, the sheer number of variants makes analysis of all variants and accurate estimation of all effects computationally challenging. Our objective was to find a strategy to approximate the analysis of whole-sequence data with a Bayesian variable selection model. Using a simulated dataset, we applied a Bayes R hybrid model to analyse whole-sequence data, test the effect of dropping a proportion of variants during the analysis, and test how the analysis can be split into separate analyses per chromosome to reduce the elapsed computing time. We also investigated the effect of imputation errors on prediction accuracy. Subsequently, we applied the approach to a dataset that contained imputed sequences and records for production and fertility traits for 38,492 Holstein, Jersey, Australian Red and crossbred bulls and cows.ResultsWith the simulated dataset, we found that prediction accuracy was highly increased for a breed that was not represented in the training population for sequence data compared to HD SNP data. Either dropping part of the variants during the analysis or splitting the analysis into separate analyses per chromosome decreased accuracy compared to analysing whole-sequence data. First, dropping variants from each chromosome and reanalysing the retained variants together resulted in an accuracy similar to that obtained when analysing whole-sequence data. Adding imputation errors decreased prediction accuracy, especially for errors in the validation population. With real data, using sequence variants resulted in accuracies that were similar to those obtained with the HD SNPs.ConclusionsWe present an efficient approach to approximate analysis of whole-sequence data with a Bayesian variable selection model. The lack of increase in prediction accuracy when applied to real data could be due to imputation errors, which demonstrates the importance of developing more accurate methods of imputation or directly genotyping sequence variants that have a major effect in the prediction equation.

Highlights

  • The increasing availability of whole-genome sequence data is expected to increase the accuracy of genomic prediction

  • Dropping 70 or 90% of the variants after 10,000 Monte Carlo Markov chain (MCMC) iterations resulted in accuracies that were similar or slightly reduced compared to those with S_FULL_D0

  • We focus the discussion on two points, i.e. (1) on the ability to reduce the computing time needed for analysis of whole-genome sequence data by using an EM-MCMC hybrid approach, dropping some variants from the analysis and processing chromosomes in parallel, and (2) on the reasons why genome sequence data may or may not result in higher accuracies than HD single nucleotide polymorphism (SNP) genotypes

Read more

Summary

Introduction

The increasing availability of whole-genome sequence data is expected to increase the accuracy of genomic prediction. The increasing availability of whole-sequence data, which should contain causative mutations for complex traits, is expected to increase the accuracy of genomic prediction and to aid in the identification of these causative. Results from both simulation studies and analysis of real data do not always show an increase in accuracy from sequence data compared to SNP chip genotypes. Studies using whole-sequence data in dairy cattle [4] and chicken [5] showed no or very little increase in prediction accuracy compared to high-density SNP data, using either genomic best linear unbiased prediction (GBLUP) or a Bayesian variable selection model. In dairy cattle [8, 9] and Drosophila [10], substantial increases in accuracy were obtained when several tens, hundreds or thousands variants were selected based on a genome-wide association study (GWAS) and used for prediction in addition to genomewide SNPs

Objectives
Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.