Abstract

Article Figures and data Abstract eLife digest Introduction Results Discussion Materials and methods Data availability References Decision letter Author response Article and author information Metrics Abstract Similarity between two individuals in the combination of genetic markers along their chromosomes indicates shared ancestry and can be used to identify historical connections between different population groups due to admixture. We use a genome-wide, haplotype-based, analysis to characterise the structure of genetic diversity and gene-flow in a collection of 48 sub-Saharan African groups. We show that coastal populations experienced an influx of Eurasian haplotypes over the last 7000 years, and that Eastern and Southern Niger-Congo speaking groups share ancestry with Central West Africans as a result of recent population expansions. In fact, most sub-Saharan populations share ancestry with groups from outside of their current geographic region as a result of gene-flow within the last 4000 years. Our in-depth analysis provides insight into haplotype sharing across different ethno-linguistic groups and the recent movement of alleles into new environments, both of which are relevant to studies of genetic epidemiology. https://doi.org/10.7554/eLife.15266.001 eLife digest Our genomes contain a record of historical events. This is because when groups of people are separated for generations, the DNA sequence in the two groups’ genomes will change in different ways. Looking at the differences in the genomes of people from the same population can help researchers to understand and reconstruct the historical interactions that brought their ancestors together. The mixing of two populations that were previously separate is known as admixture. Africa as a continent has few written records of its history. This means that it is somewhat unknown which important movements of people in the past generated the populations found in modern-day Africa. Busby et al. have now attempted to use DNA to look into this and reconstruct the last 4000 years of genetic history in African populations. As has been shown in other regions of the world, the new analysis showed that all African populations are the result of historical admixture events. However, Busby et al. could characterize these events to unprecedented level of detail. For example, multiple ethnic groups from The Gambia and Mali all show signs of sharing the same set of ancestors from West Africa, Europe and Asia who mixed around 2000 years ago. Evidence of a migration of people from Central West Africa, known as the Bantu expansion, could also be detected, and was shown to carry genes to the south and east. An important next step will be to now look at the consequences of the observed gene-flow, and ask if it has contributed to spreading beneficial, or detrimental, mutations around Africa. https://doi.org/10.7554/eLife.15266.002 Introduction Advances in DNA analysis technology and the drive to understand the genetic basis of human phenotypes has led to a rapid growth in the amount of genomic data that is available for analysis. Whilst tens of thousands of genetic variants have been associated with different diseases in populations of European descent (Welter et al., 2014), less progress has been made in studies of important diseases in Africa (Need and Goldstein, 2009). Several consortia are beginning to focus on understanding the genetic basis of infectious and non-communicable disease specifically in Africa (Malaria Genomic Epidemiology Network, 2008; 2015; H3Africa Consortium, 2014; Gurdasani et al., 2014), and a number of recent studies have described patterns of genetic variation across the continent (Campbell and Tishkoff, 2008; Tishkoff et al., 2009; Gurdasani et al., 2014). Analyses of the structure of genetic variation are important in the design, analysis, and interpretation of genetic epidemiology studies – which aim to uncover novel relationships between genes, the environment, and disease (Malaria Genomic Epidemiology Network, 2015) – and provide an opportunity to relate patterns of association to historical connections between different human populations. Admixture occurs when genetically differentiated ancestral groups come together and mix, a process which is increasingly regarded as a common feature of human populations across the globe (Patterson et al., 2012; Hellenthal et al., 2014; Busby et al., 2015). Genome-wide analyses of African populations are refining previous models of the continent’s history and its impact on genetic diversity. One insight is the identification of clear, but complex, evidence for the movement of Eurasian ancestry back into the continent as a result of admixture over a variety of timescales (Pagani et al., 2012; Pickrell et al., 2014; Gurdasani et al., 2014; Hodgson et al., 2014a; Llorente et al., 2015). On a broad sample of 18 ethnic groups from eight countries, the African Genome Variation Project (AGVP) (Gurdasani et al., 2014) recreated a previous analysis to identify recent Eurasian admixture, within the last 1.5 thousand years (ky), in the Fulani of West Africa (Tishkoff et al., 2009; Henn et al., 2012) and several East African groups from Kenya; older Eurasian ancestry (2–5 ky) in Ethiopian groups, consistent with previous studies of similar populations (Pagani et al., 2012; Pickrell et al., 2014); and a novel signal of ancient (>7.5 ky) Eurasian admixture in the Yoruba of Central West Africa (Gurdasani et al., 2014). Comparisons of contemporary sub-Saharan African populations with the first ancient genome from within Africa, a 4.5 ky Ethiopian individual (Llorente et al., 2015), provide additional support for limited migration of Eurasian ancestry back into East Africa within the last 3000 years. Within this timescale, the major demographic change within Africa was the transition from hunting and gathering to pastoralist and agricultural lifestyles (Diamond and Bellwood, 2003; Smith, 2005; Barham and Mitchell, 2008; Li et al., 2014). This shift was long and complex and occurred at different speeds, instigating contrasting interactions between the agriculturalist pioneers and the inhabitant people (Mitchell, 2002; Marks et al., 2014). The change was initialised by the spread of pastoralism (i.e. the raising and herding of livestock) across Africa and the subsequent movement east and south from Central West Africa of agricultural technology together with the branch of Niger-Congo languages known as Bantu (Mitchell, 2002; Barham and Mitchell, 2008). The extent to which this cultural expansion was accompanied by people is an active research question, but an increasing number of molecular studies indicate that the expansion of languages was accompanied by the diffusion of people (Beleza et al., 2005; Berniell-Lee et al., 2009; Tishkoff et al., 2009; Pakendorf et al., 2011; de Filippo et al., 2012; Ansari Pour et al., 2013; Li et al., 2014; González-Santos et al., 2015). The AGVP also found evidence of widespread hunter-gatherer ancestry in African populations, including ancient (9 ky) Khoesan ancestry in the Igbo from Nigeria, and more recent hunter-gatherer ancestry in eastern (2.5–4.5 ky) and southern (0.9–4 ky) African populations (Gurdasani et al., 2014). The identification of hunter-gatherer ancestry in non-hunter-gatherer populations together with the timing of these latter events is consistent with the known expansion of Bantu languages across Africa within the last 3 ky (Mitchell, 2002; Diamond and Bellwood, 2003; Smith, 2005; Barham and Mitchell, 2008; Marks et al., 2014; Li et al., 2014). These studies have described the novel and important influence of both Eurasian and hunter-gatherer ancestry on the population genetic history of sub-Saharan Africa and provide an important description of the movement of alleles and haplotypes into and within the continent, but questions remain of the extent and timing of key events, and their impact on contemporary populations. Here we analyse genome-wide data from 12 Eurasian and 46 sub-Saharan African groups. Half (23) of the African groups represent subsets of samples collected from nine countries as part of the MalariaGEN consortium. Details on the recruitment of samples in relation to studying malaria genetics are published elsewhere (Malaria Genomic Epidemiology Network, 2014; 2015). The remaining 23 groups are from publicly available datasets from a further eight sub-Saharan African countries (Pagani et al., 2012; Schlebusch et al., 2012; Petersen et al., 2013) and the 1000 Genomes Project (1KGP), with Eurasian groups from the latter included to help understand the genetic contribution from outside of the continent (Figure 1—figure supplement 1). With the exception of Austronesian in Madagascar, African languages can be broadly classified into four major macro-families: Afroasiatic, Nilo-Saharan, Niger-Congo, and Khoesan (Blench, 2006); and although we have representative groups from each (Supplementary file 1), our sample represents a significant proportion of the sub-Saharan population in terms of number, but not does not equate to a complete picture of African ethnic diversity. We created an integrated dataset of genotypes at 328,000 high-quality SNPs and use established approaches for comparing population allele frequencies across groups to provide a baseline view of historical gene-flow. We then apply statistical approaches to phasing genotypes to obtain haplotypes for each individual, and use previously published methods to represent the haplotypes that an individual carries as a mosaic of other haplotypes in the sample (so-called chromosome painting [Li and Stephens, 2003]). We present a detailed picture of haplotype sharing across sub-Saharan Africa using a model-based clustering approach that groups individuals using haplotype information alone. The inferred groups reflect broad-scale geographic patterns. At finer scales, our analysis reveals smaller groups, and often differentiates closely related populations consistent with self-reported ancestry (Tishkoff et al., 2009; Bryc et al., 2010; Hodgson et al., 2014a). We describe these patterns by measuring gene-flow between populations and relate them to potential historical movements of people into and within sub-Saharan Africa. Understanding the extent to which individuals share haplotypes (which we call coancestry), rather than independent markers, can provide a rich description of ancestral relationships and population history (Lawson et al., 2012; Leslie et al., 2015). For each group we use the latest analytical tools to characterise the populations as mixtures of haplotypes and provide estimates for the date of admixture events (Lawson et al., 2012; Hellenthal et al., 2014; Leslie et al., 2015; Montinaro et al., 2015). As well as providing a quantitative measure of the coancestry between groups, we identify the dominant events which have shaped current genetic diversity in sub-Saharan Africa. We close by discussing the relevance of these observations to studying genotype-phenotype associations in Africa. Results Broad-scale population structure reflects geography and language Throughout this article we use shorthand current-day geographical and ethno-linguistic labels to describe ancestry. For example we write “Eurasian ancestry in East African Niger-Congo speakers”, where the more precise definition would be “ancestry originating from groups currently living in Eurasia in groups currently living in East Africa that speak Niger-Congo languages” (Pickrell et al., 2014). We also stress that the use of Khoesan in the current setting refers to groups with shared linguistic characteristics which does not necessarily imply shared close genealogical relationships (Güldemann and Fehn, 2014). Our combined dataset included 3283 individuals from 46 sub-Saharan different African ethnic groups and 12 non-African populations (Figure 1A and Figure 1—figure supplement 1). An initial fineSTRUCTURE analysis (outlined below and in Figure 1—figure supplement 2 and Figure 1—figure supplement 3) demonstrated sub-structure in two of the African ethnic groups, the Fula and Mandinka, so we split both of these populations into two groups, giving a final set of 48 African groups for all analyses. Figure 1 with 3 supplements see all Download asset Open asset Sub-Saharan African genetic variation is shaped by ethno-linguistic and geographical similarity. (A) the origin of the 46 African ethnic groups used in the analysis; ethnic groups from the same country are given the same colour, but different shapes; the legend describes the identity of each point. Figure 1—figure supplement 1 and Figure 1—source data 1 provide further detail on the provenance of these samples. (B) PCA shows that the first major axis of variation in Africa (PC1, y-axis) splits southern groups from the rest of Africa, each symbol represents an individual; PC2 (x-axis) reflects ethno-linguistic differences, with Niger-Congo speakers split from Afroasiatic and Nilo-Saharan speakers. Tick marks here and in (C) show the scale. (C) The third principle component (PC3, x-axis) represents geographical separation of Niger-Congo speakers, forming a cline from west to east Africans (D) results of the fineSTRUCTURE clustering analysis using copying vectors generated from chromosome painting; each row of the heatmap is a recipient copying vector showing the number of chunks shared between the recipient and every individual as a donor (columns);the tree clusters individuals with similar copying vectors together, such that block-like patterns are observed on the heat map; darker colours on the heatmap represent more haplotype sharing (see text for details); individual tips of the tree are coloured by country of origin, and the seven ancestry regions are identified and labelled to the left of the tree; labels in parentheses describe the major linguistic type of the ethnic groups within: AA = Afroasiatic, KS = Khoesan, NC = Niger-Congo, NS = Nilo-Saharan. https://doi.org/10.7554/eLife.15266.003 Figure 1—source data 1 Overview of sampled populations describing the continent, region, numbers of individuals used, and the source of any previously published datasets. https://doi.org/10.7554/eLife.15266.004 Download elife-15266-fig1-data1-v1.xlsx As an initial description of the genetic structure of the samples we applied principal component analysis to the genotype data (Patterson et al., 2006). As in other regions of the world (Novembre et al., 2008; Behar et al., 2010), the leading principal components show that genetic relationships are broadly defined by geographical and ethno-linguistic similarity (Figure 1B,C). The first two principal components (PCs) reflect ethno-linguistic divides: PC1 splits southern Khoesan speaking populations from the rest of Africa, and PC2 splits the East African Afroasiatic and Nilo-Saharan speakers from sub-Saharan African Niger-Congo speakers. The third axis of variation defines east versus west Africa, suggesting that in general, population structure in Africa largely mirrors linguistic and geographic similarity (Tishkoff et al., 2009). To access the information from the combination of markers along chromosomes we phased the genotype data into haplotypes, and applied a previously published implementation of chromosome painting (CHROMOPAINTER [Lawson et al., 2012]), to estimate the amount of an individual’s genome that is shared with each other individual in the data. More specifically, we paint each recipient individual’s genome as a mosaic of haplotype segments (chunks) copied from each other donor individual, and summarise these as copying vectors. We used the clustering algorithm implemented in fineSTRUCTURE (Lawson et al., 2012) to group individuals purely on the similarity of these copying vectors (Figure 1 and Figure 1—figure supplement 3). The pairwise coancestry between individuals can be visualised as a heatmap with each row being the copying vector for each sample (Figure 1D), and these are clustered hierarchically to form a tree which describes the inferred relationship between different groups (Figure 1—figure supplement 3). The fineSTRUCTURE analysis identified 154 clusters of individuals, grouped on the basis of copying vector similarity (Figure 1—figure supplement 3). Some ethnic groups, such as the Yoruba, Mossi, Jola and Ju/’hoansi form clusters containing only individuals from their own ethnic group. In other populations, most notably from The Gambia and Kenya, individuals from several different ethnic groups cluster together. These are the two countries where the most ethnic groups were sampled, seven and four respectively, and differential sampling could partly explain this observation. Consistent with PCA, the fineSTRUCTURE analysis indicates that African populations tend to share more DNA with geographically proximate populations (dark colours on the diagonal; Figure 1D). Block-like structures on the diagonal indicate higher levels of haplotype sharing, as measured by the number of chunks copied, within groups. These patterns are strongest in a subset of the Khoesan speaking individuals (eg. the Ju/’hoansi), several groups from the East Africa (Sudanese, Ari, and Somali groups), and the Fulani and Jola from The Gambia. Using the results of the PCA and fineSTRUCTURE analyses together with ethno-linguistic classifications and geography, we defined seven groups of populations within Africa (Supplementary file 1), which we refer to as ancestry regions (shown on the left of Figure 1D) when describing gene-flow across Africa. From this perspective, the heatmap also shows evidence for coancestry across the continent (more chunks copied away from the diagonal), which is indicative of historical connections between modern-day groups. For example, east Africans from Kenya, Malawi and Tanzania tend to share more DNA with west Africans (lower right) than vice versa (upper left), which suggests that more haplotypes may have spread from west to east Africa. These patterns of coancestry provide evidence of widespread sharing of haplotypes within and between ancestry regions. Haplotypes reveal subtle population structure To quantify population structure, we used two metrics to measure the difference between each of the 48 African and 12 Eurasian groups. First, we used the classical measure FST (Hudson et al., 1992; Bhatia et al., 2013) which measures the differentiation in SNP allele frequencies between two groups. The second metric uses the difference in copying vectors between two groups to estimate the total variation distance (TVD) (Leslie et al., 2015) at the haplotypic level which provides an alternative measure of differentiation based on combinations of alleles at SNPs along chromosomes. Figure 2A shows these two metrics side by side in the upper and lower diagonal. When compared to the level of differentiation between Eurasian and African populations, FST measured at our integrated set of SNPs is relatively low between many groups from West, Central, and East Africa (yellows on the upper right triangle). In contrast, TVD between the same populations highlights haplotypic differences within Africa which are as strong as between Europe and Asia (pink and purples in lower left triangle). Whilst pairwise TVD tends to increase with pairwise FST the relationship is neither perfect (Pearson’s correlation R2 = 0.79) nor linear (Figure 2B). For example, the Chonyi from Kenya have relatively low FST but high TVD with West African groups, like the Jola (Chonyi-Jola FST = 0.019; Chonyi-Jola TVD = 0.803) showing that, whilst allele frequency differences between the two populations are relatively low, when we compare the populations’ copying vectors, the haplotypic differences are some of the strongest between sub-Saharan groups. Figure 2 with 1 supplement see all Download asset Open asset Haplotypes capture more population structure than independent loci. (A) For each population pair, we estimated pairwise FS⁢T (upper right triangle) using 328,000 independent SNPs, and T⁢V⁢D (lower left triangle) using population averaged copying vectors from CHROMOPAINTER. T⁢V⁢D measures the difference between two copying vectors. (B) Comparison of pairwise FS⁢T and T⁢V⁢D shows that they are not linearly related: some population pairs have low FS⁢T and high T⁢V⁢D. (Source data is detailed in Figure 2—source data 2 to Figure 2—source data 1). https://doi.org/10.7554/eLife.15266.008 Figure 2—source data 1 Pairwise TVD for Eurasian populations. TVD has been multiplied by 1000. https://doi.org/10.7554/eLife.15266.009 Download elife-15266-fig2-data1-v1.xlsx Figure 2—source data 2 Pairwise FST for Eurasian populations. We used smartpca to compute FST for each pair of populations, upper right diagonal, together with standard errors computed using a block jacknife. FST has been multiplied by 1000. https://doi.org/10.7554/eLife.15266.010 Download elife-15266-fig2-data2-v1.xlsx Figure 2—source data 3 Pairwise FST for African populations. We used smartpca to compute FST for each pair of populations, upper right diagonal, together with standard errors computed using a block jacknife. FST has been multiplied by 1000. https://doi.org/10.7554/eLife.15266.011 Download elife-15266-fig2-data3-v1.xlsx Figure 2—source data 4 Pairwise TVD for African populations. TVD has been multiplied by 1000. https://doi.org/10.7554/eLife.15266.012 Download elife-15266-fig2-data4-v1.xlsx In Figure 2—figure supplement 1 we show a comparison of PCA, based on genotype data, and fineSTRUCTURE, which uses haplotypes, from a subset of individuals from the Central West African Niger-Congo ancestry region (from Nigeria, Ghana, and Burkina Faso). Whilst we observe some, limited, population structure with PCA, when we look at the copying vectors, we can see the subtle differences in copying that cause fineSTRUCTURE to separate the five ethnic groups into clusters containing only other individuals from their own ethnic group of individuals. The exception to this are the Namkam and Kasem, who are very genetically similar (pairwise FST of < 0.001) and are merged into a single group. So, consistent with results in European populations (Leslie et al., 2015; Busby et al., 2015), chromosome painting analyses of African groups can reveal subtle population structure that is hard to detect using approaches based on genotypes alone (for example PCA and FST). Taken together, these observations motivate using haplotype-based approaches to characterise population relationships, in addition to those which consider allele frequencies on their own. Allele frequency differences show widespread evidence for admixture As argued above, a full analysis of admixture best leverages haplotype structure, and we return to this below. To gain an initial understanding of admixture, we applied previously published approaches which analyse the correlations in allele frequencies within and between populations (Pickrell et al., 2014; Gurdasani et al., 2014). The first approach, the three-population test (f3 statistic [Reich et al., 2009]), estimates the proportion of shared genetic drift between a target population and two potential source populations to identify significant departures from the null model of no admixture. Negative values are indicative of canonical admixture events where the allele frequencies in the target population are intermediate between the two source populations. Consistent with recent research (Pickrell et al., 2014; Pickrell and Reich, 2014; Gurdasani et al., 2014; Llorente et al., 2015), the majority (83%, 40/48), but not all, of the African groups surveyed showed evidence of admixture (f3<-5). (Supplementary file 2). We do not infer admixture using this statistic in the Jola, Mossi, Kasem, Namkam, Yoruba, Sudanese, Gumuz, and Ju/’hoansi. In most other groups the most significant f3 statistic includes either the Ju/’hoansi or a 1KGP European source (GBR, CEU, FIN, or TSI). Niger-Congo speaking groups from Central West and Southern Africa tend to show most significant statistics involving the Ju/’hoansi, whereas West and East African and Southern Khoesan speaking groups tended to show most significant statistics involving European sources, consistent with an recent analysis on a similar (albeit smaller) set of African populations (Gurdasani et al., 2014). The second approach, ALDER (Loh et al., 2013; Pickrell et al., 2014) (Supplementary file 2) exploits the fact that correlations between allele frequencies along the genome decay over time as a result of recombination. Linkage disequilibrium (LD) can be generated by admixture events, and leaves detectable signals in the genome that can be used to infer historical processes (Loh et al., 2013). Following Pickrell et al. (2014) and the AGVP (Gurdasani et al., 2014), we computed weighted admixture LD curves using the ALDER (Loh et al., 2013) package and the HAPMAP recombination map to characterise the sources and timing of gene-flow events. Specifically, we estimated the y-axis intercept (amplitude) of weighted LD curves for each target population using curves from an analysis where one of the sources was the target population (self reference) and the other was, separately, each of the other (non-self reference) populations. Theory predicts that the amplitude of these 'one-reference' curves becomes larger the more similar the non-self reference population is to the true admixing source (Loh et al., 2013). As with the f3 analysis outlined above, for many of the sub-Saharan African populations, Eurasian and hunter-gatherer groups (such as the Ju/’hoansi) produced the largest amplitudes (Figure 3—figure supplement 1 and Figure 3—figure supplement 2), reinforcing the contribution of these ancestries to our broad set of African populations. We investigated the evidence for more complex admixture using MALDER (Pickrell et al., 2014), an implementation of ALDER which fits a mixture of exponentials to weighted LD curves to infer multiple admixture events (Figure 3 and Figure 3—source data 1). In Figure 3A, for each target population, we show the ancestry region of the two populations with the greatest MALDER curve amplitudes, together with the date of admixture, for at most two events. Throughout, we convert time since admixture in generations to a date by assuming a generation time of 29 years (Fenner, 2005). We note that the inferred admixture dates indicate when gene-flow occurred between populations and not the arrival of groups into an area, which may often be several generations earlier. Figure 3 with 7 supplements see all Download asset Open asset Inference of admixture in sub-Saharan Africa using MALDER. We used MALDER to identify the evidence for multiple waves of admixture in each population. (A) For each population, we show the ancestry region identity of the two populations involved in generating the MALDER curves with the greatest amplitudes (coloured blocks) for at most two events. The major contributing sources are highlighted with a black box. Populations are ordered by ancestry of the admixture sources and dates estimates which are shown ± 1.96 × s.e. For each event we compared the MALDER curves with the greatest amplitude to other curves involving populations from different ancestry regions. In the central panel, for each source, we highlight the ancestry regions providing curves that are not significantly different from the best curves. In the Jola, for example, this analysis shows that, although the curve with the greatest amplitude is given by Khoesan (green) and Eurasian (dark yellow) populations, curves containing populations from any other African group (apart from Afroasiatic) in place of a Khoesan population are not significantly smaller than this best curve (SOURCE 1). Conversely, when comparing curves where a Eurasian population is substituted with a population from another group, all curve amplitudes are significantly smaller (Z<2). (B) Comparison of dates of admixture ± 1.96 × s.e. for MALDER dates inferred using the HAPMAP recombination map and a recombination map inferred from European (CEU) individuals from Hinch et al. (2011). We only show comparisons for dates where the same number of events were inferred using both methods. Point symbols refer to populations and are as in Figure 1. (C) as (B) but comparison uses an African (YRI) map. Source data can be found in Figure 3—source data 1. https://doi.org/10.7554/eLife.15266.014 Figure 3—source data 1 The evidence for multiple waves of admixture in African populations using MALDER and the HAPMAP recombination map. For each event in each ethnic group we show the largest inferred amplitude and date of an admixture event involving two reference populations (Pop1 and Pop2). We additionally provide the ancestry region identity of the two main reference populations, together with Z scores for curve comparisons between this best curve and those containing populations from different ancestry regions. We use a cut-off of Z< 2 to decide whether sources from multiple ancestries best describe the admixture source. https://doi.org/10.7554/eLife.15266.015 Download elife-15266-fig3-data1-v1.xlsx Figure 3—source data 2 The evidence for multiple waves of admixture in African populations using MAL

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call