Interpreting Supervised Machine Learning Inferences in Population Genomics Using Haplotype Matrix Permutations

  • Abstract
  • References
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Supervised machine learning methods, such as convolutional neural networks (CNNs), that use haplotype matrices as input data have become powerful tools for population genomics inference. However, these methods often lack interpretability, making it difficult to understand which population genetics features drive their predictions—a critical limitation for method development and biological interpretation. Here, we introduce a systematic permutation approach that progressively disrupts population genetics features within input test haplotype matrices, including linkage disequilibrium, haplotype structure, and allele frequencies. By measuring performance degradation after each permutation, the importance of each feature can be assessed. We applied our approach to three published CNNs for positive selection and demographic history inference. We found that the positive selection inference CNN ImaGene critically depends on haplotype structure and linkage disequilibrium patterns, while the demographic inference CNN relies primarily on allele frequency information. Surprisingly, another positive selection inference CNN, disc-pg-gan, achieved high accuracy using only simple allele count information, suggesting its training regime may not adequately challenge the model to learn complex population genetic signatures. Our approach provides a straightforward, model-agnostic, and biologically-motivated framework for interpreting any haplotype matrix-based method, offering insights that can guide both method development and application in population genomics.

ReferencesShowing 10 of 41 papers
  • Cite Count Icon 2
  • 10.1111/men.v21.8
  • Nov 1, 2021
  • Molecular Ecology Resources

  • Open Access Icon
  • Cite Count Icon 2
  • 10.1093/molbev/msae242
Digital Image Processing to Detect Adaptive Evolution.
  • Nov 20, 2024
  • Molecular biology and evolution
  • Md Ruhul Amin + 2 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 8628
  • 10.1186/s40537-019-0197-0
A survey on Image Data Augmentation for Deep Learning
  • Jul 6, 2019
  • Journal of Big Data
  • Connor Shorten + 1 more

  • Open Access Icon
  • Cite Count Icon 1905
  • 10.1093/bioinformatics/btq134
Permutation importance: a corrected feature importance measure
  • Apr 12, 2010
  • Bioinformatics
  • André Altmann + 3 more

  • Cite Count Icon 353
Quantile Graphical Models: Bayesian Approaches.
  • Jan 1, 2020
  • Journal of machine learning research : JMLR
  • Bani K Mallick + 2 more

  • Open Access Icon
  • Cite Count Icon 3537
  • 10.1214/ss/1009213726
Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)
  • Aug 1, 2001
  • Statistical Science
  • Leo Breiman

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 265
  • 10.1371/journal.pcbi.1004845
Deep Learning for Population Genetic Inference.
  • Mar 28, 2016
  • PLOS Computational Biology
  • Sara Sheehan + 1 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 408
  • 10.1016/j.tig.2017.12.005
Supervised Machine Learning for Population Genetics: A New Paradigm
  • Jan 10, 2018
  • Trends in genetics : TIG
  • Daniel R Schrider + 1 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 39
  • 10.1093/genetics/iyad068
Dispersal inference from population genetic variation using a convolutional neural network.
  • Apr 13, 2023
  • Genetics
  • Chris C R Smith + 3 more

  • Open Access Icon
  • Cite Count Icon 1824
  • 10.1093/genetics/49.1.49
THE INTERACTION OF SELECTION AND LINKAGE. I. GENERAL CONSIDERATIONS; HETEROTIC MODELS
  • Jan 10, 1964
  • Genetics
  • R C Lewontin

Similar Papers
  • Research Article
  • 10.1101/2025.03.24.644668
ConfuseNN: Interpreting convolutional neural network inferences in population genomics with data shuffling
  • Mar 27, 2025
  • bioRxiv
  • Linh N Tran + 2 more

Convolutional neural networks (CNNs) have become powerful tools for population genomic inference, yet understanding which genomic features drive their performance remains challenging. We introduce ConfuseNN, a method that systematically shuffles input haplotype matrices to disrupt specific population genetic features and evaluate their contribution to CNN performance. By sequentially removing signals from linkage disequilibrium, allele frequency, and other population genetic patterns in test data, we evaluate how each feature contributes to CNN performance. We applied ConfuseNN to three published CNNs for demographic history and selection inference, confirming the importance of specific data features and identifying limitations of network architecture and of simulated training and testing data design. ConfuseNN provides an accessible biologically motivated framework for interpreting CNN behavior across different tasks in population genetics, helping bridge the gap between powerful deep learning approaches and traditional population genetic theory.

  • Research Article
  • Cite Count Icon 17
  • 10.1089/gtmb.2010.0170
Haplotype Structures of Common Variants ofCYP2C8,CYP2C9, andADRB1Genes in a South Indian Population
  • Mar 4, 2011
  • Genetic Testing and Molecular Biomarkers
  • Annan Sudarsan Arun Kumar + 7 more

In association with candidate genes, the observed trait may be due to either one of the variant alleles or the interaction of variant alleles at different loci, which are in linkage disequilibrium. The objective of this study was to investigate the baseline allele and genotype frequencies, linkage disequilibrium (LD) patterns, and haplotype structures of common variants of the CYP2C8, CYP2C9, and ADRB1 genes located on chromosome 10. Two hundred and forty-five healthy subjects were recruited from South India and were compared with the HapMap Project's population for LD pattern, allele and genotype frequencies, and haplotype structures. Genotyping was done using polymerase chain reaction-restriction fragment length polymorphism and TaqMan assay on real-time polymerase chain reaction. A significant ethnic difference was found in the LD patterns among the variant alleles between the South Indian population and other major ethnic groups, namely African, European, Chinese, and Japanese. This study established the normative allele and genotype frequencies, haplotype structure, and LD patterns of common variants of the CYP2C8, CYP2C9, and ADRB1 genes in a South Indian population (Tamilian). The data may be helpful to plan candidate gene-trait association studies in this population.

  • Research Article
  • Cite Count Icon 330
  • 10.1086/316944
Extent and Distribution of Linkage Disequilibrium in Three Genomic Regions
  • Jan 1, 2001
  • The American Journal of Human Genetics
  • Gonçalo R Abecasis + 12 more

Extent and Distribution of Linkage Disequilibrium in Three Genomic Regions

  • Research Article
  • Cite Count Icon 189
  • 10.1093/molbev/msw216
Signatures of Archaic Adaptive Introgression in Present-Day Human Populations.
  • Oct 18, 2016
  • Molecular biology and evolution
  • Fernando Racimo + 2 more

Comparisons of DNA from archaic and modern humans show that these groups interbred, and in some cases received an evolutionary advantage from doing so. This process—adaptive introgression—may lead to a faster rate of adaptation than is predicted from models with mutation and selection alone. Within the last couple of years, a series of studies have identified regions of the genome that are likely examples of adaptive introgression. In many cases, once a region was ascertained as being introgressed, commonly used statistics based on both haplotype as well as allele frequency information were employed to test for positive selection. Introgression by itself, however, changes both the haplotype structure and the distribution of allele frequencies, thus confounding traditional tests for detecting positive selection. Therefore, patterns generated by introgression alone may lead to false inferences of positive selection. Here we explore models involving both introgression and positive selection to investigate the behavior of various statistics under adaptive introgression. In particular, we find that the number and allelic frequencies of sites that are uniquely shared between archaic humans and specific present-day populations are particularly useful for detecting adaptive introgression. We then examine the 1000 Genomes dataset to characterize the landscape of uniquely shared archaic alleles in human populations. Finally, we identify regions that were likely subject to adaptive introgression and discuss some of the most promising candidate genes located in these regions.

  • Research Article
  • Cite Count Icon 31
  • 10.1111/j.1365-2710.2011.01294.x
Distribution of CYP2C19*17 allele and genotypes in an Indian population
  • Sep 15, 2011
  • Journal of Clinical Pharmacy and Therapeutics
  • D Anichavezhi + 4 more

CYP2C19*17 allele increases the metabolic activity of CYP2C19 resulting in decreased therapeutic levels of CYP2C19 substrates. There exist inter-ethnic differences in the distribution of this allele. The present study was aimed at establishing the allele and genotype frequencies of CYP2C19*17 in a South Indian Tamilian population. Furthermore, we describe the haplotype structure of the three common variant alleles of CYP2C19 in the Tamilian population. Two hundred and six subjects of South Indian Tamilian origin were genotyped for CYP2C19*17 allele by nested polymerase chain reaction and restriction fragment length polymorphism. A subset of 87 subjects were also genotyped for CYP2C19*2 and CYP2C19*3 alleles. After ascertaining linkage disequilibrium (LD), haplotypes were constructed. Allele and genotype frequencies, LD pattern and haplotype frequency were compared with those of the HapMap populations. The CYP2C19*17 allele frequency in the Tamilian population (n = 206) was found to be 19·2% (95% CI: 15·4 - 20·3). The CYP2C19*2 allele frequency (n = 87) was found to be 40·2% (95% CI: 32·9 - 47·5), whereas the CYP2C19*3 allele was not detected in the study subjects (n = 97). The high frequency of the CYP2C19*17 allele in the study population has resulted in a revision of frequencies for CYP2C19*1/*2 (31·0%) and CYP2C19*1/*1 (16·1%) genotypes in the Tamilian population. We also observed significant differences in haplotype structure and frequencies of these variant alleles in the HapMap population compared to Tamilian population. CYP2C19*17 allele is present at high frequency in the Tamilian population. This study also demonstrates the need for reassessment of wild-type allele frequencies in view of CYP2C19*17 allele. The estimated high frequency of CYP2C19*17 allele will aid in genotype-phenotype association studies in the Tamilian population. Further genotype-phenotype association studies are required to evaluate the clinical utility of this allele in South Indians.

  • Research Article
  • Cite Count Icon 13
  • 10.1097/fpc.0b013e32835a3a6d
Similarity in recombination rate and linkage disequilibrium at CYP2C and CYP2D cytochrome P450 gene regions among Europeans indicates signs of selection and no advantage of using tagSNPs in population isolates
  • Dec 1, 2012
  • Pharmacogenetics and Genomics
  • Ville N Pimenoff + 7 more

Linkage disequilibrium (LD) and recombination rate variations are known to vary considerably between human genome regions and populations mostly because of the combined effects of mutation, recombination, and demographic history. Thus, the pattern of LD is a key issue to disentangle variants associated with complex traits. Here, we aim to describe the haplotype structure and LD variation at the pharmacogenetically relevant cytochrome P450 CYP2C and CYP2D gene regions among European populations. To assess the haplotype structure, LD pattern, and recombination rate variations in the clinically significant CYP2C and CYP2D regions, we genotyped 143 single-nucleotide polymorphisms (SNPs) across these two genome regions in a diverse set of 11 European population samples and one sub-Saharan African sample. Our results showed extended patterns of LD and in general a low rate of recombination at these loci, and a low degree of allele differentiation for the two cytochrome P450 regions among Europeans, with the exception of the Sami and the Finns as European outliers. The Sami sample showed reduced haplotype diversity and higher LD for the two cytochrome P450 regions than the other Europeans, a feature that is proposed to enhance the LD mapping of underlying common complex traits. However, recombination hotspots and LD blocks at these two regions showed highly consistent structures across Europeans including Finns and Sami. Moreover, we showed that the CEPH sample has significantly higher tag transferability among Europeans and a more efficient tagging of both the rare CYP2C9 and the common CYP2C19 functional variants than the Sami. Our data set included CYP2C9*3 (rs1057910) and CYP2C19*2 (rs4244285) enzyme activity-altering variants associated in a recent genome-wide study with acenocoumarol-induced and warfarin-induced anticoagulation or to the antiplatelet effect of clopidogrel, respectively. Including these known activity-altering variants, we showed the haplotype variation and high derived allele frequencies of novel recently identified acenocoumarol genome-wide associated SNPs at CYP2C9 (rs4086116) and CYP2C18 (rs12772169, rs1998591, rs2104543, rs1042194) loci in a comprehensive set of 11 European populations. Furthermore, a significant frequency difference of a CYP2C19*2 gene mutation causing variable drug reactions was observed among Europeans. The CEPH sample representing the general European population as such in the HapMap project seems to be the optimal population sample for the LD mapping of common complex traits among Europeans. Nevertheless, it is still argued that the unique pattern of LD in the Sami may offer an advantage for further association mapping, especially if multiple rare variants play a role in disease etiology. However, besides the activity-altering CYP2C9*3 (rs1057910) and CYP2C19*2 (rs4244285) variants, the high derived allele frequencies of novel recently identified acenocoumarol genome-wide associated SNPs at CYP2C9 (rs4086116) and CYP2C18 (rs12772169, rs1998591, rs2104543, rs1042194) loci variants indicated that the CYP2C region may have been influenced by selection. Thus, this fine-scale haplotype map of the CYP2C and CYP2D regions may help to choose markers for further association mapping of complex pharmacogenetic traits at these loci.

  • Dissertation
  • 10.53846/goediss-1889
STUDY OF GENOMIC STRUCTURE AND SIGNATURES OF RECENT POSITIVE SELECTION IN CATTLE
  • Feb 20, 2022
  • Saber Qanbari

STUDY OF GENOMIC STRUCTURE AND SIGNATURES OF RECENT POSITIVE SELECTION IN CATTLE

  • Research Article
  • Cite Count Icon 200
  • 10.1186/1471-2156-7-6
Analysis of molecular diversity, population structure and linkage disequilibrium in a worldwide survey of cultivated barley germplasm (Hordeum vulgare L.).
  • Jan 24, 2006
  • BMC Genetics
  • Lyudmyla V Malysheva-Otto + 2 more

BackgroundThe goal of our study was a systematic survey of the molecular diversity in barley genetic resources. To this end 953 cultivated barley accessions originating from all inhabited continents except Australia were genotyped with 48 SSR markers. Molecular diversity was evaluated with routine statistics (allelic richness, gene diversity, allele frequency, heterozygosity and unique alleles), Principal Coordinate Analysis (PCoA), and analysis of genome-wide linkage disequilibrium.ResultsA genotyping database for 953 cultivated barley accessions profiled with 48 SSR markers was established. The PCoA revealed structuring of the barley population with regard to (i) geographical regions and (ii) agronomic traits. Geographic origin contributed most to the observed molecular diversity. Genome-wide linkage disequilibrium (LD) was estimated as squared correlation of allele frequencies (r2). The values of LD for barley were comparable to other plant species (conifers, poplar, maize). The pattern of intrachromosomal LD with distances between the genomic loci ranging from 1 to 150 cM revealed that in barley LD extended up to distances as long as 50 cM with r2 > 0.05, or up to 10 cM with r2 > 0.2. Few loci mapping to different chromosomes showed significant LD with r2 > 0.05. The number of loci in significant LD as well as the pattern of LD were clearly dependent on the population structure. The LD in the homogenous group of 207 European 2-rowed spring barleys compared to the highly structured worldwide barley population was increased in the number of loci pairs with r2 > 0.05 and had higher values of r2, although the percentage of intrachromosomal loci pairs in significant LD based on P < 0.001 was 100% in the whole set of varieties, but only 45% in the subgroup of European 2-rowed spring barleys. The value of LD also varied depending on the polymorphism of the loci selected for genotyping. The 17 most polymorphic loci (PIC > 0.80) provided higher LD values as compared to 19 low polymorphic loci (PIC < 0.73) in both structured (all accessions) and non-structured (European 2-rowed spring varieties) barley populations.ConclusionA global population of cultivated barley accessions was highly structured. Clustering highlighted the accessions with the same geographic origin, as well as accessions possessing similar agronomic characters. LD in barley extended up to 50 cM, and was strongly dependent on the population structure. The data on LD were summarized as a genome-wide LD map for barley.

  • Research Article
  • Cite Count Icon 179
  • 10.1016/j.ajhg.2010.05.004
Evolutionary and Functional Analysis of Celiac Risk Loci Reveals SH2B3 as a Protective Factor against Bacterial Infection
  • Jun 1, 2010
  • American journal of human genetics
  • Alexandra Zhernakova + 16 more

Evolutionary and Functional Analysis of Celiac Risk Loci Reveals SH2B3 as a Protective Factor against Bacterial Infection

  • Research Article
  • Cite Count Icon 172
  • 10.1111/j.1365-2052.2009.02011.x
The pattern of linkage disequilibrium in German Holstein cattle
  • Jul 12, 2010
  • Animal Genetics
  • S Qanbari + 6 more

This study presents a second generation of linkage disequilibrium (LD) map statistics for the whole genome of the Holstein-Friesian population, which has a four times higher resolution compared with that of the maps available so far. We used DNA samples of 810 German Holstein-Friesian cattle genotyped by the Illumina Bovine SNP50K BeadChip to analyse LD structure. A panel of 40 854 (75.6%) markers was included in the final analysis. The pairwise r(2) statistic of SNPs up to 5 Mb apart across the genome was estimated. A mean value of r(2) = 0.30 +/- 0.32 was observed in pairwise distances of <25 kb and it dropped to 0.20 +/- 0.24 at 50-75 kb, which is nearly the average inter-marker space in this study. The proportion of SNPs in useful LD (r(2) > or = 0.25) was 26% for the distance of 50 and 75 kb between SNPs. We found a lower level of LD for SNP pairs at the distance < or =100 kb than previously thought. Analysis revealed 712 haplo-blocks spanning 4.7% of the genome and containing 8.0% of all SNPs. Mean and median block length were estimated as 164 +/- 117 kb and 144 kb respectively. Allele frequencies of the SNPs have a considerable and systematic impact on the estimate of r(2). It is shown that minimizing the allele frequency difference between SNPs reduces the influence of frequency on r(2) estimates. Analysis of past effective population size based on the direct estimates of recombination rates from SNP data showed a decline in effective population size to N(e) = 103 up to approximately 4 generations ago. Systematic effects of marker density and effective population size on observed LD and haplotype structure are discussed.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 92
  • 10.1186/s12870-019-1917-5
Genetic diversity, linkage disequilibrium, and population structure analysis of the tea plant (Camellia sinensis) from an origin center, Guizhou plateau, using genome-wide SNPs developed by genotyping-by-sequencing
  • Jul 23, 2019
  • BMC Plant Biology
  • Suzhen Niu + 7 more

BackgroundTo efficiently protect and exploit germplasm resources for marker development and breeding purposes, we must accurately depict the features of the tea populations. This study focuses on the Camellia sinensis (C. sinensis) population and aims to (i) identify single nucleotide polymorphisms (SNPs) on the genome level, (ii) investigate the genetic diversity and population structure, and (iii) characterize the linkage disequilibrium (LD) pattern to facilitate next genome-wide association mapping and marker-assisted selection.ResultsWe collected 415 tea accessions from the Origin Center and analyzed the genetic diversity, population structure and LD pattern using the genotyping-by-sequencing (GBS) approach. A total of 79,016 high-quality SNPs were identified; the polymorphism information content (PIC) and genetic diversity (GD) based on these SNPs showed a higher level of genetic diversity in cultivated type than in wild type. The 415 accessions were clustered into three groups by STRUCTURE software and confirmed using principal component analyses (PCA)—wild type, cultivated type, and admixed wild type. However, unweighted pair group method with arithmetic mean (UPGMA) trees indicated the accessions should be grouped into more clusters. Further analyses identified four groups, the Pure Wild Type, Admixed Wild Type, ancient landraces and modern landraces using STRUCTURE, and the results were confirmed by PCA and UPGMA tree method. A higher level of genetic diversity was detected in ancient landraces and Admixed Wild Type than that in the Pure Wild Type and modern landraces. The highest differentiation was between the Pure Wild Type and modern landraces. A relatively fast LD decay with a short range (kb) was observed, and the LD decays of four inferred populations were different.ConclusionsThis study is, to our knowledge, the first population genetic analysis of tea germplasm from the Origin Center, Guizhou Plateau, using GBS. The LD pattern, population structure and genetic differentiation of the tea population revealed by our study will benefit further genetic studies, germplasm protection, and breeding.

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.etap.2014.10.022
Polymorphic genetic variations of Cytochrome P450 19A1 and T-cell leukemia 1A genes in the Tamil population
  • Nov 1, 2014
  • Environmental Toxicology and Pharmacology
  • Gurusamy Umamaheswaran + 6 more

Polymorphic genetic variations of Cytochrome P450 19A1 and T-cell leukemia 1A genes in the Tamil population

  • Research Article
  • Cite Count Icon 2
  • 10.47248/hpgg2404030008
The Quantitative Genetics of Human Disease: 2 Polygenic Risk Scores
  • Aug 19, 2024
  • Human Population Genetics and Genomics
  • David J Cutler + 3 more

In this the second of an anticipated four papers, we examine polygenic risk scores from a quantitative genetics perspective. In its most simplistic form, a polygenic risk score (PRS) analysis involves estimating the genetic effects of alleles in one study and then using those estimates to predict phenotype in another sample of individuals. Almost since the first application of these types of analyses it has been noted that PRSs often give unexpected and difficult-to-interpret results, particularly when applying effect-size estimates taken from individuals with ancestry very different than those to whom it is applied (applying PRSs across differing populations). To understand these seemingly perplexing observations, we deconstruct the effects of applying valid statistical estimates taken from one population to another when the two populations have differing allele frequencies at the sites contributing effect, when alleles with effects in one population are absent from the other, and finally when there is differing linkage disequilibrium (LD) patterns in the two populations. It will be shown that many of the seemingly most confusing results in the field are natural consequences of these factors. Given our best current understanding of human demographic history, most of the patterns seen in PRS analysis can be predicted as resulting from systematic differences in allele frequency and LD. Put the other way around, the most challenging and confusing results seen in cross population application of PRSs are likely to be the result of allele frequency and LD differences, not differences in the genetic effects of individual alleles. PRS analysis is an important tool both for understanding the genetic basis of complex phenotypes and, potentially, for identifying individuals at risk of developing disease before such disease manifests. As such it has the potential to be among the most important analysis frameworks in human genetics. Nevertheless, when a PRS is trained in people with one ancestry and then applied to people with another, the PRS’s behavior is often unpredictable, and sometimes is seemingly perverse. PRS distributions are often nearly non-overlapping between individuals with differing ancestry, i.e., odds ratios for unaffected people with one ancestry might be vastly larger than affected individuals from another. The correlation between a PRS and known phenotype might differ substantially, and sometimes the correlation is higher among people with ancestry different than the one used to create the PRS. Naively, one might conclude from these observations that the genetic basis of traits differs substantially among people of differing ancestry, and that the behavior of a PRS is difficult to predict when applied to new study populations. Differing definitions of genetic effect sizes are discussed, and key observations are made. It is shown that when populations differ in allele frequency, a locus affecting phenotype could have equal differences in allelic (additive) effects or equal additive variances, but not both. They cannot have equal additive effects, equal allelic penetrances, or equal odds ratios. PRS is defined, and its moments are derived. The effect of differing allele frequency and LD patterns is described. Perplexing PRS observations are discussed in light of theory and human demographic history. Suggestions for best practices for PRS construction are made. The most confusing results seen in cross population application of PRSs are often the predictable result of allele frequency and LD differences. There is relatively little evidence for systematic differences in the genetic basis of disease in individuals of differing ancestry, other than that which results from environmental, allele frequency, and LD differences.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 18
  • 10.1186/1479-7364-1-6-399
Geographic stratification of linkage disequilibrium: a worldwide population study in a region of chromosome 22
  • Jan 1, 2004
  • Human Genomics
  • Anna González-Neira + 6 more

Recent studies of haplotype diversity in a number of genomic regions have suggested that long stretches of DNA are preserved in the same chromosome, with little evidence of recombination events. The knowledge of the extent and strength of these haplotypes could become a powerful tool for future genetic analysis of complex traits. Different patterns of linkage disequilibrium (LD) have been found when comparing individuals of African and European descent, but there is scarce knowledge about the worldwide population stratification. Thus, the study of haplotype composition and the pattern of LD from a global perspective are relevant for elucidating their geographical stratification, as it may have implications in the future analysis of complex traits. We have typed 12 single nucleotide polymorphisms in a chromosome 22 region--previously described as having high LD levels in European populations -- in 39 different world populations. Haplotype structure has a clear continental structure with marked heterogeneity within some continents (Africa, America). The pattern of LD among neighbouring markers exhibits a strong clustering of all East Asian populations on the one hand and of Western Eurasian populations (including Europe) on the other, revealing only two major LD patterns, but with some very specific outliers due to specific demographic histories. Moreover, it should be taken into account that African populations are highly heterogeneous. The present results support the existence of a wide (but not total) communality in LD patterns in human populations from different continental regions, despite differences in their demographic histories, as population factors seem to be less relevant compared with genomic forces in shaping the patterns of LD.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.1371/journal.pone.0064150
HGPGD: The Human Gene Population Genetic Difference Database
  • May 22, 2013
  • PLoS ONE
  • Yongshuai Jiang + 12 more

Demographic events such as migration, and evolutionary events like mutation and recombination, have contributed to the genetic variations that are found in the human genome. During the evolution and differentiation of human populations, different functional genes and pathways (a group of genes that act together to perform specific biological tasks) would have displayed different degrees of genetic diversity or evolutionary conservatism. To query the genetic differences of functional genes or pathways in populations, we have developed the human gene population genetic difference (HGPGD) database. Currently, 11 common population genetic features, 18,158 single human genes, 220 KEGG (Kyoto Encyclopedia of Genes and Genomes) human pathways and 4,639 Gene Ontology (GO) categories (3,269 in biological process; 862 in molecular function; and 508 in cellular component) are available in the HGPGD database. The 11 population genetic features are related mainly to three aspects: allele frequency, linkage disequilibrium pattern, and transferability of tagSNPs. By entering a list of Gene IDs, KEGG pathway IDs or GO category IDs and selecting a population genetic feature, users can search the genetic differences between pairwise HapMap populations. We hope that, when the researchers carry out gene-based, KEGG pathway-based or GO category-based research, they can take full account of the genetic differences between populations. The HGPGD database (V1.0) is available at http://www.bioapp.org/hgpgd.

More from: Molecular Biology and Evolution
  • New
  • Research Article
  • 10.1093/molbev/msaf286
Seeing in the deep: evolution of the opsin gene expression in Bermin crater lake cichlids.
  • Nov 7, 2025
  • Molecular biology and evolution
  • Monika Kłodawska + 5 more

  • New
  • Research Article
  • 10.1093/molbev/msaf287
Environment by environment interactions (ExE) differ across genetic backgrounds (ExExG).
  • Nov 6, 2025
  • Molecular biology and evolution
  • Kara Schmidlin + 3 more

  • New
  • Research Article
  • 10.1093/molbev/msaf284
Genomic features underlying the origin of sociality and the diversification of caste systems in termites.
  • Nov 3, 2025
  • Molecular biology and evolution
  • Kokuto Fujiwara + 7 more

  • New
  • Research Article
  • 10.1093/molbev/msaf285
Stable hypermutators revealed by the genomic landscape of genes involved in genome stability among yeast species.
  • Nov 3, 2025
  • Molecular biology and evolution
  • Carla Gonçalves + 12 more

  • New
  • Research Article
  • 10.1093/molbev/msaf283
Identifying single origin rare variants in population genomic data.
  • Nov 3, 2025
  • Molecular biology and evolution
  • Josh J Reynolds + 2 more

  • Research Article
  • 10.1093/molbev/msaf281
SARS-CoV-2 Evolution in Humans Enables its Transmission to Nonhuman Primates.
  • Nov 1, 2025
  • Molecular biology and evolution
  • Yu-Ting Chiu + 9 more

  • Research Article
  • 10.1093/molbev/msaf276
An evolutionarily conserved laterally acquired toolkit enables microbiota targeting by Trichomonas.
  • Oct 30, 2025
  • Molecular biology and evolution
  • Adam J Hart + 8 more

  • Addendum
  • 10.1093/molbev/msaf268
Correction to: Reconstruction of Microbial Haplotypes by Integration of Statistical and Physical Linkage in Scaffolding
  • Oct 29, 2025
  • Molecular Biology and Evolution

  • Research Article
  • 10.1093/molbev/msaf257
GHIST 2024: The First Genomic History Inference Strategies Tournament.
  • Oct 29, 2025
  • Molecular biology and evolution
  • Travis J Struck + 16 more

  • Research Article
  • 10.1093/molbev/msaf263
SMTdb: A Comprehensive Spatial Meta-Transcriptome Resource in Cancer.
  • Oct 29, 2025
  • Molecular biology and evolution
  • Weiwei Zhou + 11 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon