On the genomic analysis of data from structured populations.

G De Los Campos,D Sorensen

doi:10.1111/jbg.12091

Abstract

The availability of dense SNP panels allows assessing similarity between distantly related individuals, including those with different genetic background. This has renewed the interest in the analysis of data from heterogeneous populations. Examples include the genomic analysis of data from multiple breeds (Hayes et al. 2009, Gen. Sel. Evol. 41, 51), or from structured populations (e.g. de los Campos et al. 2009, Genetics 182, 375–385; Daetwyler et al. 2012, J. Anim. Sci. 90, 3375–3384). Whole genome regression (WGR) methods (Meuwissen, Hayes, and Goddard, 2001, Genetics 157, 1819– 1829), where allele substitution effects are assumed to be homogeneous across subjects, have been used for the analysis of data from structured populations (e.g. Hayes et al. 2009, Gen. Sel. Evol. 41, 51; de los Campos et al. 2009, Genetics 182, 375–385). This approach allows borrowing information across groups and, under some circumstances, can increase prediction accuracy. For example, in a combined analysis of Holstein and Jersey data, Hayes et al. (2009, Gen. Sel. Evol. 41, 51) showed that prediction accuracy of estimated breeding values could be increased, relative to a within-breed analysis, in Jerseys but not in Holsteins. However, assuming that marker effects are constant across groups ignores the fact that dominance, epistasis or differences in the marker–QTL LD (linkage disequilibrium) patterns can lead to group-specific marker effects. Principal Components (PCs) methods are commonly used in genome-wide association studies to account for population structure (e.g. Price et al. 2006, Nat. Gen. 38, 34–41; Marchini et al. 2004, Nat. Gen. 36, 512–517). Drawing on these ideas, some authors suggested expanding WGRs such as the G-BLUP (genomic best linear unbiased predictor) by adding marker-derived PCs as fixed effect covariates. This approach has been used to account for stratification in the estimation of variance components (e.g. Yang et al., 2010, Nat. Genet. 42, 565–569) and in the prediction of breeding values (e.g. Daetwyler et al., 2012, J. Anim. Sci. 90, 3375–3384). However, Janss et al. (2012, Genetics 192, 693–704) demonstrated that adding eigenvectors as fixed effects in G-BLUP can create important inferential problems. Indeed, Gaussian processes, including the G-BLUP, are equivalent to a random regression on all marker-derived PCs (e.g. de los Campos et al., 2010, Genetics Research 92, 295–308). Therefore, the PCs that are added as fixed effects in the G-BLUP, typically those with the largest eigenvalue, enter twice in the model, and this can have adverse effects on inferences on variance components. The problem is aggravated by the fact that in G-BLUP, despite the random nature, the effects of eigenvectors with large eigenvalues are effectively estimated as fixed effects. In their article, Janss et al. showed how the standard G-BLUP, parameterized using PCs, can be used to draw inferences and predictions based on all or some PCs in a coherent statistical framework. This approach should be preferred over the one using PCs as fixed effects in a G-BLUP model. However, regardless of how PCs are dealt with, when only a subset of PCs is used for inferences, the connection with the original model is lost and parameters have no genetic interpretation.

Full Text