Articles published on Statistical genetics
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
372 Search results
Sort by Recency
- Research Article
- 10.1101/gr.280659.125
- Dec 23, 2025
- Genome research
- Ruhollah Shemirani + 6 more
Population structure is a well-known confounder in statistical genetics, particularly in genome-wide association studies (GWAS), where it can lead to inflated test statistics and spurious associations. Traditional methods, such as principal components (PCs), commonly used to adjust for population structure, are limited in capturing fine-scale, nonlinear patterns that arise from recent demographic events - patterns that are crucial for understanding rare variant effects. To address this challenge, we propose a novel method called SPectral Components (SPCs), which leverages identity-by-descent (IBD) graphs to capture and transform local, nonlinear fine-scale population structure into continuous representations that can be seamlessly integrated into genetic analysis pipelines. Using both simulated datasets and empirical data from the UK Biobank (N ≈ 420,000), we demonstrate that SPCs outperform PCs in adjusting for fine-scale population structure. In simulations, SPCs explained over 90% of the fine-scale population structure with fewer components, while PCs captured less than 5%. In the UK Biobank, SPCs reduced the inflation of P-values in the GWAS of an environmental-driven phenotype by 12% compared to PCs, while maintaining a similar performance to PCs in height, a highly heritable phenotype. Additionally, SPCs improved rare variant association analyses, reducing genomic inflation (e.g., from 7.6 to 1.2 in one analysis), and provided more accurate heritability estimates. Spatial autocorrelation analysis further confirmed the ability of SPCs to account for environmental effects, reducing Moran's I for both environmental and heritable phenotypes more effectively than PCs. Overall, our findings demonstrate that SPCs provide a robust, scalable adjustment for recent population structure, offering a powerful alternative or complement to PCs in large-scale biobank studies.
- Research Article
- 10.11648/j.cbb.20251302.13
- Dec 11, 2025
- Computational Biology and Bioinformatics
- Tanmay Bandbe + 2 more
Anaplastic lymphoma kinase (ALK) has been linked to several hematological malignancies; however, its comprehensive genetic variability and potential disease associations are not fully understood. In this study, a structure-guided genome-wide association analysis (GWAS) of ALK variants was performed using publicly available summary statistics and R-based analytical pipelines. The GWAS datasets were acquired, filtered, and ranked based on sample size to ensure sufficient statistical power. A focused analysis on two distinct datasets, which were selected based on sample size and phenotypic diversity: one representing lymphoma-related genetic traits from the UK Biobank, and another capturing ALK-associated proteomic variation. Rigorous quality control and comprehensive data visualization were performed using a set of diagnostic and analytical plots, including volcano plots, QQ plots, histograms, size effects, and a correlation matrix heatmap of numerical variables. Regional Manhattan plots highlighted distinct, highly significant associations at the ALK locus in both datasets, enabling the identification of independent lead variants. Interpretation of the QQ plots and histograms confirmed adequate control for population stratification and minimal inflation of test statistics. Integration of insights from the effect size distribution and SE versus Beta plots provided a clear assessment of the precision and reliability of estimated genetic effects. By mapping genetic variants onto the ALK protein structure, single-nucleotide polymorphisms (SNPs) with potential functional relevance and evaluating their associations with disease phenotypes across populations were prioritized. This strategy facilitates the identification of variants likely to influence protein structure and function, thereby enhancing the interpretability of GWAS findings in a protein-centric context. This approach demonstrates the power of integrating structural bioinformatics with statistical genetics to reveal novel genotype-phenotype relationships, offering valuable insights for precision medicine and targeted ALK-directed therapies. Overall, this integrative methodology establishes a reproducible framework for detailed regional GWAS analyses, successfully pinpointing strong ALK locus associations and identifying candidate variants for subsequent functional validation relevant to the phenotypes, and assessing their potential role in therapeutic investigation for hematological malignancies.
- Research Article
- 10.1016/j.ajhg.2025.10.016
- Dec 1, 2025
- American journal of human genetics
- Diane Xue + 13 more
Training competencies and recommendations for the next generation of public health genetics: Reflections from current leaders in the field.
- Research Article
- 10.1016/j.ajhg.2025.10.005
- Dec 1, 2025
- American journal of human genetics
- Alejandro Mejia-Garcia + 14 more
Using the ancestral recombination graph to study the history of rare variants in founder populations.
- Research Article
- 10.1038/s41598-025-19697-x
- Oct 14, 2025
- Scientific Reports
- Zhonghai Wang + 2 more
Primary biliary cholangitis (PBC) may affect skeletal muscles through the muscle-liver axis, subsequently leading to sarcopenia. Our study aims to explore the unclear genetic relationships between PBC and sarcopenia. We investigated the shared genetic architecture of PBC and sarcopenia using advanced statistical genetics methods and genome-wide association summary data. We employed global and local genetic correlation to gain potential shared biological mechanisms. We identified risk single nucleotide polymorphisms (SNPs) and functionally annotated genomic multi-markers by conducting the unified test for molecular signatures. Finally, we prioritized fine-mapping analysis to emphasize the significant causal genes. Our study has identified significant genomic associations, suggesting the complex genetic interactions between PBC and sarcopenia. At the genomic level, we identified 17 unique bivariate regions among 88 trait pairs. In the bivariate locus analysis, we identified a total of 136 pleiotropic loci, with ASTN1, TGFB2, and ACP1 being particularly prominent. Functional enrichment analysis highlighted putative pleiotropic genomic regions, including brain and spleen. Furthermore, the identified pleiotropic loci demonstrate strong signal transduction in the cGMP–PKG signaling pathway. Our findings highlight shared genetic links and causal relationships between PBC and sarcopenia, offering novel insights into their genetic mechanisms.Supplementary InformationThe online version contains supplementary material available at 10.1038/s41598-025-19697-x.
- Research Article
- 10.1101/2025.06.04.25328990
- Sep 3, 2025
- medRxiv
- Ruhollah Shemirani + 6 more
Population structure is a well-known confounder in statistical genetics, particularly in genome-wide association studies (GWAS), where it can lead to inflated test statistics and spurious associations. Traditional methods, such as principal components (PCs), commonly used to adjust for population structure, are limited in capturing fine-scale, non-linear patterns that arise from recent demographic events – patterns that are crucial for understanding rare variant effects. To address this challenge, we propose a novel method called SPectral Components (SPCs), which leverages identity-by-descent (IBD) graphs to capture and transform local, non-linear fine-scale population structure into continuous representations that can be seamlessly integrated into genetic analysis pipelines. Using both simulated datasets and empirical data from the UK Biobank (N ≈ 420,000), we demonstrate that SPCs outperform PCs in adjusting for fine-scale population structure. In simulations, SPCs explained over 90% of the fine-scale population structure with fewer components, while PCs captured less than 5%. In the UK Biobank, SPCs reduced the inflation of p-values in the GWAS of an environmental-driven phenotype by 12% compared to PCs, while maintaining a similar performance to PCs in height, a highly heritable phenotype. Additionally, SPCs improved rare variant association analyses, reducing genomic inflation (e.g., from 7.6 to 1.2 in one analysis), and provided more accurate heritability estimates. Spatial autocorrelation analysis further confirmed the ability of SPCs to account for environmental effects, reducing Moran’s I for both environmental and heritable phenotypes more effectively than PCs. Overall, our findings demonstrate that SPCs provide a robust, scalable adjustment for recent population structure, offering a powerful alternative or complement to PCs in large-scale biobank studies.
- Research Article
- 10.1093/bioadv/vbaf205
- Aug 26, 2025
- Bioinformatics Advances
- Drew Dehaas + 1 more
MotivationWhile there are a variety of file formats for storing reference-sequence-aligned genotype data, many are complex or inefficient. Programming language support for such formats is often limited. A file format that is simple to understand and implement—yet fast and small—is helpful for research on highly scalable statistical and population genetics methods.ResultsWe present the Indexable Genotype Data (IGD) file format, a simple uncompressed binary format that can be more than 100× faster and 3.5× smaller than vcf.gz on biobank-scale whole-genome sequence data. The implementation for reading and writing IGD in Python is under 350 lines of code, which reflects the simplicity of the format.Availability and implementationA C++ library for reading and writing IGD, and tooling to convert .vcf.gz files, can be found at https://github.com/aprilweilab/picovcf. A Python library is at https://github.com/aprilweilab/pyigd.
- Research Article
- 10.1101/2025.08.15.670378
- Aug 20, 2025
- bioRxiv
- Aditya Syam + 2 more
Motivation:The Genotype Representation Graph (GRG) [DeHaas et al., 2025] is a graph representation of whole genome polymorphisms, designed to encode the variant hard-call information in phased whole genomes. It encodes the genotypes as an extremely compact graph that can be traversed efficiently, enabling dynamic programming-style algorithms on applications such as genome-wide association studies that run faster on biobank-scale data than existing alternatives. To facilitate scalable statistical genetics, we present GrgPhenoSim, an extremely fast phenotype simulator for GRGs, suitable for simulating phenotypes on biobank-scale datasets.Results:GrgPhenoSim contains all the primary functionalities of a phenotype simulator, uses a standardized output, and supports customized simulations. GrgPhenoSim is dozens to hundreds of times faster than tstrait [Tagami et al., 2024], a fast ancestral recombination graph-based phenotype simulator, when the sample size ranges from thousands to hundreds of thousands samples.Availability:The GrgPhenoSim library and use-case demonstrations are available at https://github.com/aprilweilab/grg_pheno_simThe documentation for GrgPhenoSim is hosted at https://grgl.readthedocs.io/en/latest/index.html
- Research Article
4
- 10.1101/2024.11.11.24317065
- Aug 13, 2025
- medRxiv
- Nikhil Milind + 5 more
The genome-wide burdens of deletions, loss-of-function mutations, and duplications correlate with many traits. Curiously, for most of these traits, variants that decrease expression have the same genome-wide average direction of effect as variants that increase expression. This seemingly contradicts the intuition that for individual genes reducing expression should have the opposite effect on a phenotype as increasing expression. To understand this paradox, we use the gene dosage response curve (GDRC), which relates changes in gene expression to expected changes in phenotype. We show that, for many traits, GDRCs are systematically biased in one trait direction relative to the other, and we develop a simple theoretical model that explains this bias in trait direction. Our results have broad implications for complex traits, drug discovery, and statistical genetics.
- Research Article
- 10.1038/s41467-025-61712-2
- Aug 7, 2025
- Nature communications
- Shilpa Nadimpalli Kobren + 17 more
Genomics for rare disease diagnosis has advanced at a rapid pace due to our ability to perform in-depth analyses on individual patients with ultra-rare diseases. The increasing sizes of ultra-rare disease cohorts internationally newly enables cohort-wide analyses for new discoveries, but well-calibrated statistical genetics approaches for jointly analyzing these patients are still under development. The Undiagnosed Diseases Network (UDN) brings multiple clinical, research and experimental centers under the same umbrella across the United States to facilitate and scale case-based diagnostic analyses. Here, we present the first joint analysis of whole genome sequencing data of UDN patients across the network. We introduce new, well-calibrated statistical methods for prioritizing disease genes with de novo recurrence and compound heterozygosity. We also detect pathways enriched with candidate and known diagnostic genes. Our computational analysis, coupled with a systematic clinical review, recapitulated known diagnoses and revealed new disease associations. We further release a software package, RaMeDiES, enabling automated cross-analysis of deidentified sequenced cohorts for new diagnostic and research discoveries. Gene-level findings and variant-level information across the cohort are available in a public-facing browser ( https://dbmi-bgm.github.io/udn-browser/ ). These results show that case-level diagnostic efforts should be supplemented by a joint genomic analysis across cohorts.
- Research Article
1
- 10.1101/2025.07.10.664154
- Jul 15, 2025
- bioRxiv
- Luke J O’Connor + 1 more
The ‘polygenicity’ of traits is often invoked and sometimes quantified in quantitative, statistical, and human genetics. What do we mean by the polygenicity of a trait? We propose a principled definition that encompasses a range of polygenicity measures. We show that these measures satisfy certain mathematical properties, we argue that these properties are sensible if not necessary, and we show that, conversely, measures that satisfy these properties also satisfy our definition. We consider four specific measures in greater detail, describe how they differ and show that three of them can be estimated from GWAS summary statistics using an existing method, Fourier Mixture Regression. We estimate these measures for 36 traits in humans. We find a dearth of traits with polygenicity values that fall within the large gap between Mendelian and highly polygenic traits. We discuss the evolutionary and cellular processes underlying trait polygenicity.
- Research Article
3
- 10.3389/fimmu.2025.1543781
- May 8, 2025
- Frontiers in immunology
- Vera Fominykh + 10 more
Based on clinical, biomarker, and genetic data, McGonagle and McDermott suggested that autoimmune and autoinflammatory disorders can be classified as a disease continuum from purely autoimmune to autoinflammatory with mixed diseases in between. However, the genetic architecture of this spectrum has not been systematically described. Here, we investigate the continuum of polygenic immune-mediated disorders using genome-wide association studies (GWAS) and statistical genetics methods. We mapped the genetic landscape of 15 immune-mediated disorders using GWAS summary statistics and methods including genomic structural equation modeling (genomic SEM), linkage disequilibrium score regression, Local Analysis of [co]Variant Association, and Gaussian causal mixture modeling (MiXeR). We performed enrichment analyses of tissues and biological gene sets using MAGMA. Genomic SEM suggested a continuum structure with four underlying latent factors from autoimmune diseases at one end to autoinflammatory on the opposite end. Across disorders, we observed a balanced mixture of negative and positive local genetic correlations within the major histocompatibility complex, while outside this region, local genetic correlations were predominantly positive. MiXeR analysis showed large genetic overlap in accordance with the continuum landscape. MAGMA analysis implicated genes associated with known monogenic immune diseases for prominent autoimmune and autoinflammatory component. Our findings support a polygenic continuum across immune-mediated disorders, with four genetic clusters. The "polygenic autoimmune" and "polygenic autoinflammatory" clusters reside on margins of this continuum. These findings provide insights and lead us to hypothesize that the identified clusters could inform future therapeutical strategies, with patients in the same clusters potentially responding similarly to specific therapies.
- Research Article
3
- 10.1093/genetics/iyaf071
- Apr 15, 2025
- Genetics
- Jennifer Blanc + 1 more
Polygenic scores have become an important tool in human genetics, enabling the prediction of individuals' phenotypes from their genotypes. Understanding how the pattern of differences in polygenic score predictions across individuals intersects with variation in ancestry can provide insights into the evolutionary forces acting on the trait in question and is important for understanding health disparities. However, because most polygenic scores are computed using effect estimates from population samples, they are susceptible to confounding by both genetic and environmental effects that are correlated with ancestry. The extent to which this confounding drives patterns in the distribution of polygenic scores depends on the patterns of population structure in both the original estimation panel and in the prediction/test panel. Here, we use theory from population and statistical genetics, together with simulations, to study the procedure of testing for an association between polygenic scores and axes of ancestry variation in the presence of confounding. We use a general model of genetic relatedness to describe how confounding in the estimation panel biases the distribution of polygenic scores in ways that depends on the degree of overlap in population structure between panels. We then show how this confounding can bias tests for associations between polygenic scores and important axes of ancestry variation in the test panel. Specifically, for any given test, there exists a single axis of population structure in the genome-wide association study (GWAS) panel that needs to be controlled for in order to protect the test. In the context of this result, we study the behavior of multiple approaches to control for stratification along this axis, including standard methods such using principal components as fixed covariates in the GWAS, linear mixed models, and a novel approach for directly estimating the axis using the test panel genotypes. Our analyses highlight the role of estimation noise in the models of population structure as a plausible source of residual confounding in polygenic score analyses.
- Research Article
- 10.1007/s10681-025-03500-z
- Apr 9, 2025
- Euphytica
- João Marcos Amario De Sousa + 6 more
Statistical genetics models with residual and genetic structures enhance the accuracy of selecting wheat populations
- Research Article
5
- 10.1038/s41467-025-56884-w
- Mar 3, 2025
- Nature Communications
- Edoardo Bertolini + 7 more
An early event in plant organogenesis is establishment of a boundary between the stem cell containing meristem and differentiating lateral organ. In maize (Zea mays), evidence suggests a common gene network functions at boundaries of distinct organs and contributes to pleiotropy between leaf angle and tassel branch number, two agronomic traits. To uncover regulatory variation at the nexus of these two traits, we use regulatory network topologies derived from specific developmental contexts to guide multivariate genome-wide association analyses. In addition to defining network plasticity around core pleiotropic loci, we identify new transcription factors that contribute to phenotypic variation in canopy architecture, and structural variation that contributes to cis-regulatory control of pleiotropy between tassel branching and leaf angle across maize diversity. Results demonstrate the power of informing statistical genetics with context-specific developmental networks to pinpoint pleiotropic loci and their cis-regulatory components, which can be used to fine-tune plant architecture for crop improvement.
- Research Article
1
- 10.1101/2025.02.03.636375
- Feb 5, 2025
- bioRxiv : the preprint server for biology
- Qinwen Zheng + 16 more
Previous genetic studies of human assortative mating have primarily focused on searching for its genomic footprint but have revealed limited insights into its biological and social mechanisms. Combining insights from the economics of the marriage market with advanced tools in statistical genetics, we perform the first genome-wide association study (GWAS) on a latent index for partner choice. Using 206,617 individuals from four global cohorts, we uncover phenotypic characteristics and social processes underlying assortative mating. We identify a broadly robust genetic component of the partner choice index between sexes and several countries and identify its genetic correlates. We also provide solutions to reduce assortative mating-driven biases in genetic studies of complex traits by conditioning GWAS summary statistics on the genetic associations with the latent partner choice index.
- Research Article
- 10.1002/agj2.70024
- Jan 1, 2025
- Agronomy Journal
- Rafael Tobias Lang Fronza + 5 more
Abstract Few studies have investigated the effect on the genotypic value of wheat (Triticum aestivum L.) families with the adoption of the additive and epistatic (additive × additive) relationship matrix. The objective of this study is to select F2:3 families of wheat by means of three statistical genetics models (without pedigree information, additive, and additive plus additive × additive epistatic) and to evaluate the selection rank between the traditional model and the model with best fit of families for recombination and for deriving progenies. The experiment was composed of a total of 880 F2:3 families of tropical wheat, from 56 populations conducted by the genealogical method, which came from a full diallel involving the cultivars BRS 254, BRS 264, and BRS 394, CD 1303, Tbio Aton, Tbio Ponteiro, Tbio Duque, and Tbio Sossego. The pedigree matrix was calculated, obtaining approximately 20 generations of ancestry of the parents. The data were analyzed in three genetic‐statistical models: Model 1—without information on family relationship; Model 2—computing the additive relationship matrix; and Model 3—including the additive and epistatic (additive × additive) relationship matrix. Using the additive and epistatic (additive × additive) pedigree matrix has a significant effect on most traits. The selection revealed families of populations with potential to be used in recombinations: BRS 254/CD 1303, Tbio Ponteiro/BRS 394, and BRS 394/Tbio Ponteiro, with genetic value to derive progenies: BRS 254/Tbio Aton, Tbio Aton/Tbio Duque, and BRS 394/Tbio Aton, and with both attributes: BRS 254/CD 1303, BRS 394/Tbio Ponteiro, and Tbio Sossego/BRS 264.
- Research Article
- 10.1016/j.gimo.2025.103467
- Jan 1, 2025
- Genetics in Medicine Open
- Nicole Zeltser + 5 more
ApplyPolygenicScore: An R package for applying polygenic risk score models
- Research Article
- 10.1080/03610926.2024.2413853
- Nov 5, 2024
- Communications in Statistics - Theory and Methods
- Guoxin Qiua + 2 more
Ranked set sampling (RSS) has gained significant significance in numerous practical domains, such as environmental and ecological studies, and statistical genetics. This article focuses on investigating the concept of extropy as a measure of residual uncertainty in RSS. It is devoted to discussing various monotone properties and characterization results associated with the residual extropy of RSS. Additionally, a comparative analysis is conducted between the residual extropy of RSS and its counterpart in simple random sampling. Optimal minima and maxima values for the residual extropy of RSS are determined. To further contribute to the field, a consistent estimator for the residual extropy of RSS is proposed. The effectiveness of this estimator is demonstrated through three illustrative examples, highlighting its performance in practical scenarios.
- Research Article
14
- 10.1371/journal.pbio.3002847
- Oct 9, 2024
- PLoS biology
- Joshua G Schraiber + 2 more
In both statistical genetics and phylogenetics, a major goal is to identify correlations between genetic loci or other aspects of the phenotype or environment and a focal trait. In these 2 fields, there are sophisticated but disparate statistical traditions aimed at these tasks. The disconnect between their respective approaches is becoming untenable as questions in medicine, conservation biology, and evolutionary biology increasingly rely on integrating data from within and among species, and once-clear conceptual divisions are becoming increasingly blurred. To help bridge this divide, we lay out a general model describing the covariance between the genetic contributions to the quantitative phenotypes of different individuals. Taking this approach shows that standard models in both statistical genetics (e.g., genome-wide association studies; GWAS) and phylogenetic comparative biology (e.g., phylogenetic regression) can be interpreted as special cases of this more general quantitative-genetic model. The fact that these models share the same core architecture means that we can build a unified understanding of the strengths and limitations of different methods for controlling for genetic structure when testing for associations. We develop intuition for why and when spurious correlations may occur analytically and conduct population-genetic and phylogenetic simulations of quantitative traits. The structural similarity of problems in statistical genetics and phylogenetics enables us to take methodological advances from one field and apply them in the other. We demonstrate by showing how a standard GWAS technique-including both the genetic relatedness matrix (GRM) as well as its leading eigenvectors, corresponding to the principal components of the genotype matrix, in a regression model-can mitigate spurious correlations in phylogenetic analyses. As a case study, we re-examine an analysis testing for coevolution of expression levels between genes across a fungal phylogeny and show that including eigenvectors of the covariance matrix as covariates decreases the false positive rate while simultaneously increasing the true positive rate. More generally, this work provides a foundation for more integrative approaches for understanding the genetic architecture of phenotypes and how evolutionary processes shape it.