Abstract

BackgroundFor genomic prediction and genome-wide association studies (GWAS) using mixed models, covariance between individuals is estimated using molecular markers. Based on the properties of mixed models, using available molecular data for prediction is optimal if this covariance is known. Under this assumption, adding individuals to the analysis should never be detrimental. However, some empirical studies showed that increasing training population size decreased prediction accuracy. Recently, results from theoretical models indicated that even if marker density is high and the genetic architecture of traits is controlled by many loci with small additive effects, the covariance between individuals, which depends on relationships at causal loci, is not always well estimated by the whole-genome kinship.ResultsWe propose an alternative covariance estimator named K-kernel, to account for potential genetic heterogeneity between populations that is characterized by a lack of genetic correlation, and to limit the information flow between a priori unknown populations in a trait-specific manner. This is similar to a multi-trait model and parameters are estimated by REML and, in extreme cases, it can allow for an independent genetic architecture between populations. As such, K-kernel is useful to study the problem of the design of training populations. K-kernel was compared to other covariance estimators or kernels to examine its fit to the data, cross-validated accuracy and suitability for GWAS on several datasets. It provides a significantly better fit to the data than the genomic best linear unbiased prediction model and, in some cases it performs better than other kernels such as the Gaussian kernel, as shown by an empirical null distribution. In GWAS simulations, alternative kernels control type I errors as well as or better than the classical whole-genome kinship and increase statistical power. No or small gains were observed in cross-validated prediction accuracy.ConclusionsThis alternative covariance estimator can be used to gain insight into trait-specific genetic heterogeneity by identifying relevant sub-populations that lack genetic correlation between them. Genetic correlation can be 0 between identified sub-populations by performing automatic selection of relevant sets of individuals to be included in the training population. It may also increase statistical power in GWAS.Electronic supplementary materialThe online version of this article (doi:10.1186/s12711-015-0171-z) contains supplementary material, which is available to authorized users.

Highlights

  • For genomic prediction and genome-wide association studies (GWAS) using mixed models, covari‐ ance between individuals is estimated using molecular markers

  • Many genomic prediction studies showed that the prediction accuracy of the GBLUP model decreases as more individuals are added to the training population. This problem has received considerable attention in the Heslot and Jannink Genet Sel Evol (2015) 47:93 context of prediction between breeds and, so far, empirical results obtained with the GBLUP model have been disappointing

  • Hayes et al [3] showed that the expected accuracies that were derived from the mixed model matched the within-breed observed accuracies but not the between-breed observed accuracies, and poor predictive ability was observed from one breed to the other

Read more

Summary

Introduction

For genomic prediction and genome-wide association studies (GWAS) using mixed models, covari‐ ance between individuals is estimated using molecular markers. Based on the properties of mixed models, using available molecular data for prediction is optimal if this covariance is known. Under this assumption, adding individu‐ als to the analysis should never be detrimental. Dawson et al [8] used historical data from international nurseries that were collected between 1992 and 2009, and reported inconsistent accuracies when they used data from previous years to predict accuracies of later years These prediction accuracies were not explained by variation in the quality of the phenotype data of the training or validation sets. Rutkoski et al [9] showed that accuracies were lower with a training population of 365 individuals than with optimized subsets of that population that were less than half its size

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.