UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts.

Alex Diaz-Papkovich,Simon Gravel,Chief Ben-Eghan,Luke Anderson-Trocmé

doi:10.1371/journal.pgen.1008432

Abstract

Human populations feature both discrete and continuous patterns of variation. Current analysis approaches struggle to jointly identify these patterns because of modelling assumptions, mathematical constraints, or numerical challenges. Here we apply uniform manifold approximation and projection (UMAP), a non-linear dimension reduction tool, to three well-studied genotype datasets and discover overlooked subpopulations within the American Hispanic population, fine-scale relationships between geography, genotypes, and phenotypes in the UK population, and cryptic structure in the Thousand Genomes Project data. This approach is well-suited to the influx of large and diverse data and opens new lines of inquiry in population-scale datasets.

Highlights

Questions in medicine, anthropology, and related fields hinge on interpreting the deluge of genomic data provided by modern high-throughput sequencing technologies
Using uniform manifold approximation and projection (UMAP) is computationally efficient and can identify fine-scale population structure in large population datasets. We find it useful to reveal phenotypic variation among genetically related populations, and recommend it is a complement to principal component analysis in primary data visualization
Using UMAP and t-distributed stochastic neighbour embedding (t-SNE) on the genotype data presents clusters that are roughly grouped by continent, with UMAP showing a clear hierarchy of population and continental clusters, whereas t-SNE fails to assign many individuals to population clusters

Summary

Introduction

Anthropology, and related fields hinge on interpreting the deluge of genomic data provided by modern high-throughput sequencing technologies. Because genomic datasets are high-dimensional, their interpretation requires statistical methods that can comprehensively condense information in a manner that is understandable to researchers and minimizes the amount of data that is sacrificed. Both model-based and model-agnostic approaches to summarize data have played important roles in shaping our understanding of the evolution of our species [e.g., [1,2,3,4,5]]. Since each individual is represented as an L-dimensional vector, dimension reduction methods are needed to visualize the data

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: PLOS Genetics	Publication Date: Nov 1, 2019
Citations: 182	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS Genetics

Lead the way for us

Similar Papers

Author response: Simultaneous trimodal single-cell measurement of transcripts, epitopes, and chromatin accessibility using TEA-seq
Elliott Swanson ... Cara Lord
-
Elliott Swanson, et. al.Elliott Swanson ... Cara Lord
13 Feb 2021
13 Feb 2021

Developing a Continuous Severity Scale for Macular Telangiectasia Type 2 Using Deep Learning and Implications for Disease Grading
Yue Wu ... Aaron Y Lee
Ophthalmology | VOL. 131
Yue Wu, et. al.Yue Wu ... Aaron Y Lee
20 Sep 2023
Ophthalmology | VOL. 131

Dimensionality reduction by UMAP reinforces sample heterogeneity analysis in bulk transcriptomic data.
Yang Yang ... Yunbo Wei
Cell Reports | VOL. 36
Yang Yang, et. al.Yang Yang ... Yunbo Wei
01 Jul 2021
Cell Reports | VOL. 36

Author response: Molecular and spatial profiling of the paraventricular nucleus of the thalamus
Claire Gao ... Chiraag A Gohel
-
Claire Gao, et. al.Claire Gao ... Chiraag A Gohel
01 Mar 2023
01 Mar 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS Genetics