Abstract

Full text Figures and data Side by side Abstract Editor's evaluation Introduction Results Discussion Materials and methods Appendix 1 Appendix 2 Data availability References Decision letter Author response Article and author information Metrics Abstract Principal Component Analysis (PCA) and the Linear Mixed-effects Model (LMM), sometimes in combination, are the most common genetic association models. Previous PCA-LMM comparisons give mixed results, unclear guidance, and have several limitations, including not varying the number of principal components (PCs), simulating simple population structures, and inconsistent use of real data and power evaluations. We evaluate PCA and LMM both varying number of PCs in realistic genotype and complex trait simulations including admixed families, subpopulation trees, and real multiethnic human datasets with simulated traits. We find that LMM without PCs usually performs best, with the largest effects in family simulations and real human datasets and traits without environment effects. Poor PCA performance on human datasets is driven by large numbers of distant relatives more than the smaller number of closer relatives. While PCA was known to fail on family data, we report strong effects of family relatedness in genetically diverse human datasets, not avoided by pruning close relatives. Environment effects driven by geography and ethnicity are better modeled with LMM including those labels instead of PCs. This work better characterizes the severe limitations of PCA compared to LMM in modeling the complex relatedness structures of multiethnic human data for association studies. Editor's evaluation This is an important paper that presents compelling arguments (based on simulation and comprehensively reviewed background theory) that Linear Mixed Models generally should perform better at correcting for genetic and environmental confounding in GWAS than more commonly used Principal Components methods. https://doi.org/10.7554/eLife.79238.sa0 Decision letter Reviews on Sciety eLife's review process Introduction The goal of a genetic association study is to identify loci whose genotype variation is significantly correlated to given trait. Naive association tests assume that genotypes are drawn independently from a common allele frequency. This assumption does not hold for structured populations, which includes multiethnic cohorts and admixed individuals (ancient relatedness), and for family data (recent relatedness; Astle and Balding, 2009). Association studies of admixed and multiethnic cohorts, the focus of this work, are becoming more common, are believed to be more powerful, and are necessary to bring more equity to genetic medicine (Rosenberg et al., 2010; Hoffman and Dubé, 2013; Coram et al., 2013; Medina-Gomez et al., 2015; Conomos et al., 2016a; Hodonsky et al., 2017; Martin et al., 2017a; Martin et al., 2017b; Hindorff et al., 2018; Hoffmann et al., 2018; Mogil et al., 2018; Roselli et al., 2018; Wojcik et al., 2019; Peterson et al., 2019; Zhong et al., 2019; Hu et al., 2020; Simonin-Wilmer et al., 2021; Kamariza et al., 2021; Lin et al., 2021; Mahajan et al., 2022; Hou et al., 2023a). When insufficient approaches are applied to data with relatedness, their association statistics are miscalibrated, resulting in excess false positives and loss of power (Devlin and Roeder, 1999; Voight and Pritchard, 2005; Astle and Balding, 2009). Therefore, many specialized approaches have been developed for genetic association under relatedness, of which PCA and LMM are the most popular. Genetic association with PCA consists of including the top eigenvectors of the population kinship matrix as covariates in a generalized linear model (Zhang et al., 2003; Price et al., 2006; Bouaziz et al., 2011). These top eigenvectors are a new set of coordinates for individuals that are commonly referred to as PCs in genetics (Patterson et al., 2006), the convention adopted here, but in other fields PCs instead denote what in genetics would be the projections of loci onto eigenvectors, which are new independent coordinates for loci (Jolliffe, 2002). The direct ancestor of PCA association is structured association, in which inferred ancestry (genetic cluster membership, often corresponding with labels such as “European”, “African”, “Asian”, etc.) or admixture proportions of these ancestries are used as regression covariates (Pritchard et al., 2000). These models are deeply connected because PCs map to ancestry empirically (Alexander et al., 2009; Zhou et al., 2016) and theoretically (McVean, 2009; Zheng and Weir, 2016; Cabreros and Storey, 2019; Chiu et al., 2022), and they work as well as global ancestry in association studies but are estimated more easily (Patterson et al., 2006; Zhao et al., 2007; Alexander et al., 2009; Bouaziz et al., 2011). Another approach closely related to PCA is nonmetric multidimensional scaling (Zhu and Yu, 2009). PCs are also proposed for modeling environment effects that are correlated to ancestry, for example, through geography (Novembre et al., 2008; Zhang and Pan, 2015; Lin et al., 2021). The strength of PCA is its simplicity, which as covariates can be readily included in more complex models, such as haplotype association (Xu and Guan, 2014) and polygenic models (Qian et al., 2020). However, PCA assumes that the underlying relatedness space is low dimensional (or low rank), so it can be well modeled with a small number of PCs, which may limit its applicability. PCA is known to be inadequate for family data (Patterson et al., 2006; Zhu and Yu, 2009; Thornton and McPeek, 2010; Price et al., 2010), which is called ‘cryptic relatedness’ when it is unknown to the researchers, but no other troublesome cases have been confidently identified. Recent work has focused on developing more scalable versions of the PCA algorithm (Lee et al., 2012; Abraham and Inouye, 2014; Galinsky et al., 2016; Abraham et al., 2017; Agrawal et al., 2020). PCA remains a popular and powerful approach for association studies. The other dominant association model under relatedness is the LMM, which includes a random effect parameterized by the kinship matrix. Unlike PCA, LMM does not assume that relatedness is low-dimensional, and explicitly models families via the kinship matrix. Early LMMs used kinship matrices estimated from known pedigrees or using methods that captured recent relatedness only, and modeled population structure (ancestry) as fixed effects (Yu et al., 2006; Zhao et al., 2007; Zhu and Yu, 2009). Modern LMMs estimate kinship from genotypes using a non-parametric estimator, often referred to as a genetic relationship matrix, that captures the combined covariance due to family relatedness and ancestry (Kang et al., 2008; Astle and Balding, 2009; Ochoa and Storey, 2021). Like PCA, LMM has also been proposed for modeling environment correlated to genetics (Vilhjálmsson and Nordborg, 2013; Wang et al., 2022). The classic LMM assumes a quantitative (continuous) complex trait, the focus of our work. Although case-control (binary) traits and their underlying ascertainment are theoretically a challenge (Yang et al., 2014), LMMs have been applied successfully to balanced case-control studies (Astle and Balding, 2009; Kang et al., 2010) and simulations (Price et al., 2010; Wu et al., 2011; Sul and Eskin, 2013), and have been adapted for unbalanced case-control studies (Zhou et al., 2018). However, LMMs tend to be considerably slower than PCA and other models, so much effort has focused on improving their runtime and scalability (Aulchenko et al., 2007; Kang et al., 2008; Kang et al., 2010; Zhang et al., 2010; Lippert et al., 2011; Yang et al., 2011; Listgarten et al., 2012; Zhou and Stephens, 2012; Svishcheva et al., 2012; Loh et al., 2015; Zhou et al., 2018). An LMM variant that incorporates PCs as fixed covariates is tested thoroughly in our work. Since PCs are the top eigenvectors of the same kinship matrix estimate used in modern LMMs (Astle and Balding, 2009; Janss et al., 2012; Hoffman and Dubé, 2013; Zhang and Pan, 2015), then population structure is modeled twice in an LMM with PCs. However, some previous work has found the apparent redundancy of an LMM with PCs beneficial (Price et al., 2010; Tucker et al., 2014; Zhang and Pan, 2015), while others did not (Liu et al., 2011; Janss et al., 2012), and the approach continues to be used (Zeng et al., 2018; Mbatchou et al., 2021), although not always (Matoba et al., 2020). Recall that early LMMs used kinship to model family relatedness only, so population structure had to be modeled separately in those models, in practice as admixture fractions instead of PCs (Yu et al., 2006; Zhao et al., 2007; Zhu and Yu, 2009). The LMM with PCs (vs no PCs) is also believed to help better model loci that have experienced selection (Price et al., 2010; Vilhjálmsson and Nordborg, 2013) and environment effects correlated with genetics (Zhang and Pan, 2015). LMM and PCA are closely related models (Astle and Balding, 2009; Janss et al., 2012; Hoffman and Dubé, 2013; Zhang and Pan, 2015), so similar performance is expected particularly under low-dimensional relatedness. Direct comparisons have yielded mixed results, with several studies finding superior performance for LMM, notably from papers promoting advances in LMMs, while many others report comparable performance (Table 1). No papers find that PCA outperforms LMM decisively, although PCA occasionally performs better in isolated and artificial cases or individual measures, often with unknown significance. Previous studies generally used either only simulated or only real genotypes, with only two studies using both. The simulated genotype studies, which tended to have low model dimensions and FST, were more likely to report ties or mixed results (6/8), whereas real genotypes tended to clearly favor LMMs (9/11). Similarly, 10/12 papers with quantitative traits favor LMMs, whereas 6/9 papers with case-control traits gave ties or mixed results—the only factor we do not explore in this work. Additionally, although all previous evaluations measured type I error (or proxies such as genomic inflation factors Devlin and Roeder, 1999 or QQ plots), a large fraction (6/17) did not measure power (or proxies such as ROC curves), and only four used more than one number of PCs for PCA. Lastly, no consensus has emerged as to why LMM might outperform PCA or vice versa (Price et al., 2010; Sul and Eskin, 2013; Price et al., 2013; Hoffman and Dubé, 2013), or which features of the real datasets are critical for the LMM advantage other than family relatedness, resulting in unclear guidance for using PCA. Hence, our work includes real and simulated genotypes with higher model dimensions and FST matching that of multiethnic human cohorts (Ochoa and Storey, 2021; Ochoa and Storey, 2019), we vary the number of PCs, and measure robust proxies for type I error control and calibrated power. Table 1 Previous PCA-LMM evaluations in the literature. Sim. GenotypesGeneralPublicationType*K†FST‡Real §Trait ¶PowerPCs(r)BestZhao et al., 2007✓Q✓8LMMZhu and Yu, 2009I, A, F3, 8≤0.15✓Q✓1–22LMMAstle and Balding, 2009I30.10CC✓10TieKang et al., 2010✓Both2–100LMMPrice et al., 2010I, F20.01CC1MixedWu et al., 2011I, A2–40.01CC✓10MixedLiu et al., 2011S, A2–3RQ✓10TieSul and Eskin, 2013I20.01CC1TieTucker et al., 2014I20.05✓Both✓5TieYang et al., 2014✓CC✓5TieSong et al., 2015S, A2–3RQ3LMMLoh et al., 2015✓Q✓10LMMZhang and Pan, 2015✓Q✓20–100LMMLiu et al., 2016✓Q✓3–6LMMSul et al., 2018✓Q100LMMLoh et al., 2018✓Both✓20LMMMbatchou et al., 2021✓Both1LMMThis workA, T, F10–243≤0.25✓Q✓0–90LMM * Genotype simulation types. I: Independent subpopulations; S: subpopulations (with parameters drawn from real data); A: Admixture; T: Subpopulation Tree; F: Family. † Model dimension (number of subpopulations or ancestries). ‡ R: simulated parameters based on real data, FST not reported. § Evaluations using unmodified real genotypes. ¶ Q: quantitative; CC: case-control. In this work, we evaluate the PCA and LMM association models under various numbers of PCs, which are included in LMMs too. We use genotype simulations (admixture, family, and subpopulation tree models) and three real datasets: the 1000 Genomes Project (Abecasis et al., 2010; Abecasis et al., 2012), the Human Genome Diversity Panel (HGDP) (Cann et al., 2002; Rosenberg et al., 2002; Bergström et al., 2020), and Human Origins (Patterson et al., 2012; Lazaridis et al., 2014; Lazaridis et al., 2016; Skoglund et al., 2016). We simulate quantitative traits from two models: fixed effect sizes (FES) construct coefficients inverse to allele frequency, which matches real data (Park et al., 2011; Zeng et al., 2018; O’Connor et al., 2019) and corresponds to high pleiotropy and strong balancing selection (Simons et al., 2018) and strong negative selection (Zeng et al., 2018; O’Connor et al., 2019), which are appropriate assumptions for diseases; and random coefficients (RC), which are drawn independent of allele frequency, and corresponds to neutral traits (Zeng et al., 2018; Simons et al., 2018). LMM without PCs consistently performs best in simulations without environment, and greatly outperforms PCA in the family simulation and in all real datasets. The tree simulations, which model subpopulations with the tree but exclude family structure, do not recapitulate the real data results, suggesting that family relatedness in real data is the reason for poor PCA performance. Lastly, removing up to 4th degree relatives in the real datasets recapitulates poor PCA performance, showing that the more numerous distant relatives explain the result, and suggesting that PCA is generally not an appropriate model for real data. We find that both LMM and PCA are able to model environment effects correlated with genetics, and LMM with PCs gains a small advantage in this setting only, but direct modeling of environment performs much better. All together, we find that LMMs without PCs are generally a preferable association model, and present novel simulation and evaluation approaches to measure the performance of these and other genetic association approaches. Results Overview of evaluations We use three real genotype datasets and simulated genotypes from six population structure scenarios to cover various features of interest (Table 2). We introduce them in sets of three, as they appear in the rest of our results. Population kinship matrices, which combine population and family relatedness, are estimated without bias using popkin (Ochoa and Storey, 2021; Figure 1). The first set of three simulated genotypes are based on an admixture model with 10 ancestries (Figure 1A; Ochoa and Storey, 2021; Gopalan et al., 2016; Cabreros and Storey, 2019). The ‘large’ version (1000 individuals) illustrates asymptotic performance, while the ‘small’ simulation (100 individuals) illustrates model overfitting. The ‘family’ simulation has admixed founders and draws a 20-generation random pedigree with assortative mating, resulting in a complex joint family and ancestry structure in the last generation (Figure 1B). The second set of three are the real human datasets representing global human diversity: Human Origins (Figure 1D), HGDP (Figure 1G), and 1000 Genomes (Figure 1J), which are enriched for small minor allele frequencies even after MAF <1% filter (Figure 1C). Last are subpopulation tree simulations (Figure 1F, I, L) fit to the kinship (Figure 1E, H and K) and MAF (Figure 1C) of each real human dataset, which by design do not have family structure. Table 2 Features of simulated and real human genotype datasets. DatasetTypeLoci(m)Ind. (n)Subpops.* (K)Causal loci† (m1)FST‡Admix. Large sim.Admix.100 0001000101000.1Admix. Small sim.Admix.100 00010010100.1Admix. Family sim.Admix.+Pedig.100 0001000101000.1Human OriginsReal190 394292211–2432920.28HGDPReal771 3229297–54930.281000 GenomesReal1 111 26625045–262500.22Human Origins sim.Tree190 39429222432920.23HGDP sim.Tree771 32292954930.251000 Genomes sim.Tree1 111 2662504262500.21 * For admixed family, ignores additional model dimension of 20 generation pedigree structure. For real datasets, lower range is continental subpopulations, upper range is number of fine-grained subpopulations. † m1=round⁡(n⁢h2/8) to balance power across datasets, shown for h2=0.8 only. ‡ Model parameter for simulations, estimated value on real datasets. Figure 1 Download asset Open asset Population structures of simulated and real human genotype datasets. First two columns are population kinship matrices as heatmaps: individuals along x- and y-axis, kinship as color. Diagonal shows inbreeding values. (A) Admixture scenario for both Large and Small simulations. (B) Last generation of 20-generation admixed family, shows larger kinship values near diagonal corresponding to siblings, first cousins, etc. (C) Minor allele frequency (MAF) distributions. Real datasets and subpopulation tree simulations had MAF≥0.01 filter. (D) Human Origins is an array dataset of a large diversity of global populations. (G) Human Genome Diversity Panel (HGDP) is a WGS dataset from global native populations. (J) 1000 Genomes Project is a WGS dataset of global cosmopolitan populations. (F, I, L) Trees between subpopulations fit to real data. (E, H, K). Simulations from trees fit to the real data recapitulate subpopulation structure. All traits in this work are simulated. We repeated all evaluations on two additive quantitative trait models, fixed effect sizes (FES) and random coefficients (RC), which differ in how causal coefficients are constructed. The FES model captures the rough inverse relationship between coefficient and minor allele frequency that arises under strong negative and balancing selection and has been observed in numerous diseases and other traits (Park et al., 2011; Zeng et al., 2018; Simons et al., 2018; O’Connor et al., 2019), so it is the focus of our results. The RC model draws coefficients independent of allele frequency, corresponding to neutral traits (Zeng et al., 2018; Simons et al., 2018), which results in a wider effect size distribution that reduces association power and effective polygenicity compared to FES. We evaluate using two complementary measures: (1) SRMSDp (p-value signed root mean square deviation) measures p-value calibration (closer to zero is better), and (2) AUCPR (precision-recall area under the curve) measures causal locus classification performance (higher is better; Figure 2). SRMSDp is a more robust alternative to the common inflation factor λ and type I error control measures; there is a correspondence between λ and SRMSDp, with SRMSDp>0.01 giving λ>1.06 (Figure 2—figure supplement 1) and thus evidence of miscalibration close to the rule of thumb of λ>1.05 (Price et al., 2010). There is also a monotonic correspondence between SRMSDp and type I error rate (Figure 2—figure supplement 2). AUCPR has been used to evaluate association models (Rakitsch et al., 2013), and reflects calibrated statistical power (Figure 2—figure supplement 3) while being robust to miscalibrated models (Appendix 2). Figure 2 with 3 supplements see all Download asset Open asset Illustration of evaluation measures. Three archetypal models illustrate our complementary measures: M1 is ideal, M2 overfits slightly, M3 is naive. (A) QQ plot of p-values of “null” (non-causal) loci. M1 has desired uniform p-values, M2/M3 are miscalibrated. (B)SRMSDp (p-value Signed Root Mean Square Deviation) measures signed distance between observed and expected null p-values (closer to zero is better). (C) Precision and Recall (PR) measure causal locus classification performance (higher is better). (D) AUCPR (Area Under the PR Curve) reflects power (higher is better). Both PCA and LMM are evaluated in each replicate dataset including a number of PCs r between 0 and 90 as fixed covariates. In terms of p-value calibration, for PCA the best number of PCs r (minimizing mean |SRMSDp| over replicates) is typically large across all datasets (Table 3), although much smaller r values often performed as well (shown in following sections). Most cases have a mean |SRMSDp|<0.01, whose p-values are effectively calibrated. However, PCA is often miscalibrated on the family simulation and real datasets (Table 3). In contrast, for LMM, r=0 (no PCs) is always best, and is always calibrated. Comparing LMM with r=0 to PCA with its best r, LMM always has significantly smaller |SRMSDp| than PCA or is statistically tied. For AUCPR and PCA, the best r is always smaller than the best r for |SRMSDp|, so there is often a tradeoff between calibrated p-values versus classification performance. For LMM, there is no tradeoff, as r=0 often has the best mean AUCPR, and otherwise is not significantly different from the best r. Lastly, LMM with r=0 always has significantly greater or statistically tied AUCPR than PCA with its best r. Table 3 Overview of PCA and LMM evaluations for high heritability simulations. LMM r=0 vs best rPCA vs LMM r=0DatasetMetricTrait*Cal.†Best r‡P-value §Best r‡Cal.†P-value §Best model ¶Admix. Large sim.|SRMSDp|FESTrue0112True0.036TieAdmix. Small sim.|SRMSDp|FESTrue014True0.055TieAdmix. Family sim.|SRMSDp|FESTrue0190False3.9e-10*LMMHuman Origins|SRMSDp|FESTrue0189False3.9e-10*LMMHGDP|SRMSDp|FESTrue0187True4.4e-10*LMM1000 Genomes|SRMSDp|FESTrue0190False3.9e-10*LMMHuman Origins sim.|SRMSDp|FESTrue0188True0.017TieHGDP sim.|SRMSDp|FESTrue0147True0.046Tie1000 Genomes sim.|SRMSDp|FESTrue0178True9.6e-10*LMMAdmix. Large sim.|SRMSDp|RCTrue0126True0.11TieAdmix. Small sim.|SRMSDp|RCTrue014True0.00097TieAdmix. Family sim.|SRMSDp|RCTrue0190False3.9e-10*LMMHuman Origins|SRMSDp|RCTrue0190True0.00065TieHGDP|SRMSDp|RCTrue0137True1.5e-05*LMM1000 Genomes|SRMSDp|RCTrue0176True3.9e-10*LMMHuman Origins sim.|SRMSDp|RCTrue0185True0.14TieHGDP sim.|SRMSDp|RCTrue0144True8.8e-07*LMM1000 Genomes sim.|SRMSDp|RCTrue0190True3.9e-10*LMMAdmix. Large sim.AUCPRFES0135.9e-06*LMMAdmix. Small sim.AUCPRFES0120.025TieAdmix. Family sim.AUCPRFES10.35223.9e-10*LMMHuman OriginsAUCPRFES01343.9e-10*LMMHGDPAUCPRFES10.33164.4e-10*LMM1000 GenomesAUCPRFES10.1183.9e-10*LMMHuman Origins sim.AUCPRFES01363.9e-10*LMMHGDP sim.AUCPRFES01171.7e-05*LMM1000 Genomes sim.AUCPRFES01105e-10*LMMAdmix. Large sim.AUCPRRC0131.4e-05*LMMAdmix. Small sim.AUCPRRC0110.095TieAdmix. Family sim.AUCPRRC01343.9e-10*LMMHuman OriginsAUCPRRC30.4369.6e-10*LMMHGDPAUCPRRC40.21160.013Tie1000 GenomesAUCPRRC50.00490.00043TieHuman Origins sim.AUCPRRC01374.1e-10*LMMHGDP sim.AUCPRRC30.087170.0014Tie1000 Genomes sim.AUCPRRC30.37108.5e-10*LMM * FES: Fixed Effect Sizes, RC: Random Coefficients. † Calibrated: whether mean |SRMSDp|<0.01 over 50 replicates. ‡ Value of r (number of PCs) with minimum mean |SRMSDp| or maximum mean AUCPR. § Wilcoxon paired 1-tailed test of distributions (|SRMSDp| or AUCPR) between models in header. Asterisk marks significant value using Bonferroni threshold (p<α/ntests with α=0.01 and ntests=72 is the number of tests in this table). ¶ Tie if no significant difference using Bonferroni threshold. Evaluations in admixture simulations Now we look more closely at results per dataset. The complete SRMSDp and AUCPR distributions for the admixture simulations and FES traits are in Figure 3. RC traits gave qualitatively similar results (Figure 3—figure supplement 1). Figure 3 with 5 supplements see all Download asset Open asset Evaluations in admixture simulations with FES traits, high heritability. PCA and LMM models have varying number of PCs (r∈{0,…,90} on x-axis), with the distributions (y-axis) of SRMSDp (top subpanel) and AUCPR (bottom subpanel) for 50 replicates. Best performance is zero SRMSDp and large AUCPR. Zero and maximum median AUCPR values are marked with horizontal gray dashed lines, and |SRMSDp|<0.01 is marked with a light gray area. LMM performs best with r=0, PCA with various r. (A) Large simulation (n=1,000 individuals). (B) Small simulation (n=100) shows overfitting for large r. (C) Family simulation (n=1,000) has admixed founders and large numbers of close relatives from a realistic random 20-generation pedigree. PCA performs poorly compared to LMM: SRMSDp>0 for all r and large AUCPR gap. In the large admixture simulation, the SRMSDp of PCA is largest when r=0 (no PCs) and decreases rapidly to near zero at r=3, where it stays for up to r=90 (Figure 3A). Thus, PCA has calibrated p-values for r≥3, smaller than the theoretical optimum for this simulation of r=K-1=9. In contrast, the SRMSDp for LMM starts near zero for r=0, but becomes negative as r increases (p-values are conservative). The AUCPR distribution of PCA is similarly worst at r=0, increases rapidly and peaks at r=3, then decreases slowly for r>3, while the AUCPR distribution for LMM starts near its maximum at r=0 and decreases with r. Although the AUCPR distributions for LMM and PCA overlap considerably at each r, LMM with r=0 has significantly greater AUCPR values than PCA with r=3 (Table 3). However, qualitatively PCA performs nearly as well as LMM in this simulation. The observed robustness to large r led us to consider smaller sample sizes. A model with large numbers of parameters r should overfit more as r approaches the sample size n. Rather than increase r beyond 90, we reduce individuals to n=100, which is small for typical association studies but may occur in studies of rare diseases, pilot studies, or other constraints. To compensate for the loss of power due to reducing n, we also reduce the number of causal loci (see Trait Simulation), which increases per-locus effect sizes. We found a large decrease in performance for both models as r increases, and best performance for r=1 for PCA and r=0 for LMM (Figure 3B). Remarkably, LMM attains much larger negative SRMSDp values than in our other evaluations. LMM with r=0 is significantly better than PCA (r=1 to 4) in both measures (Table 3), but qualitatively the difference is negligible. The family simulation adds a 20-generation random family to our large admixture simulation. Only the last generation is studied for association, which contains numerous siblings, first cousins, etc., with the initial admixture structure preserved by geographically biased mating. Our evaluation reveals a sizable gap in both measures between LMM and PCA across all r (Figure 3C). LMM again performs best with r=0 and achieves mean |SRMSDp|<0.01. However, PCA does not achieve mean |SRMSDp|<0.01 at any r, and its best mean AUCPR is considerably worse than that of LMM. Thus, LMM is conclusively superior to PCA, and the only calibrated model, when there is family structure. Evaluations in real human genotype datasets Next, we repeat our evaluations with real human genotype data, which differs from our simulations in allele frequency distributions and more complex population structures with greater FST, numerous correlated subpopulations, and potential cryptic family relatedness. Human Origins has the greatest number and diversity of subpopulations. The SRMSDp and AUCPR distributions in this dataset and FES traits (Figure 4A) most resemble those from the family simulation (Figure 3C). In particular, while LMM with r=0 performed optimally (both measures) and satisfies mean |SRMSDp|<0.01, PCA maintained SRMSDp>0.01 for all r and its AUCPR were all considerably smaller than the best AUCPR of LMM. Figure 4 with 5 supplements see all Download asset Open asset Evaluations in real human genotype datasets with FES traits, high heritability. Same setup as Figure 3, see that for details. These datasets strongly favor LMM with no PCs over PCA, with distributions that most resemble the family simulation. (A) Human Origins. (B) Human Genome Diversity Panel (HGDP). (C) 1000 Genomes Project. HGDP has the fewest individuals among real datasets, but compared to Human Origins contains more loci and low-frequency variants. Performance (Figure 4B) again most resembled the family simulations. In particular, LMM with r=0 achieves mean |SRMSDp|<0.01 (p-values are calibrated), while PCA does not, and there is a sizable AUCPR gap between LMM and PCA. Maximum AUCPR values were lowest in HGDP compared to the two other real datasets. 1000 Genomes has the fewest subpopulations but largest number of individuals per subpopulation. Thus, although this dataset has the simplest subpopulation structure among the real datasets, we find SRMSDp and AUCPR distributions (Figure 4C) that again most resemble our earlier family simulation, with mean |SRMSDp|<0.01 for LMM only and large AUCPR gaps between LMM and PCA. Our results are qualitatively different for RC traits, which had smaller AUCPR gaps between LMM and PCA (Figure 4—figure supplement 1). Maximum AUCPR were smaller in RC compared to FES in Human Origins and 1000 Genomes, suggesting lower power for RC traits across association models. Nevertheless, LMM with r=0 was significantly better than PCA for all measures in the real datasets and RC traits (Table 3). Evaluations in subpopulation tree simulations fit to human data To better understand which features of the real datasets lead to the large differences in performance between LMM and PCA, we carried out subpopulation tree simulations. Human subpopulations are related roughly by trees, which induce the strongest correlations, so we fit trees to each real dataset and tested if data simulated from these complex tree structures could recapitulate our previous results (Figure 1). These tree simulations also feature non-uniform ancestral allele frequency distributions, which recapitulated some of the skew for smaller minor allele frequencies of the real datasets (Figure 1C). The SRMSDp and AUCPR distribution

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call