Efficient toolkit implementing best practices for principal component analysis of population genetic data

Michael G B Blum,Keurcien Luu,Russell Schwartz,Bjarni J Vilhjálmsson,John J Mcgrath,Florian Privé

doi:10.1093/bioinformatics/btaa520

Abstract

ABSTRACTMotivationPrincipal component analysis (PCA) of genetic data is routinely used to infer ancestry and control for population structure in various genetic analyses. However, conducting PCA analyses can be complicated and has several potential pitfalls. These pitfalls include (i) capturing linkage disequilibrium (LD) structure instead of population structure, (ii) projected PCs that suffer from shrinkage bias, (iii) detecting sample outliers and (iv) uneven population sizes. In this work, we explore these potential issues when using PCA, and present efficient solutions to these. Following applications to the UK Biobank and the 1000 Genomes project datasets, we make recommendations for best practices and provide efficient and user-friendly implementations of the proposed solutions in R packages bigsnpr and bigutilsr.ResultsFor example, we find that PC19–PC40 in the UK Biobank capture complex LD structure rather than population structure. Using our automatic algorithm for removing long-range LD regions, we recover 16 PCs that capture population structure only. Therefore, we recommend using only 16–18 PCs from the UK Biobank to account for population structure confounding. We also show how to use PCA to restrict analyses to individuals of homogeneous ancestry. Finally, when projecting individual genotypes onto the PCA computed from the 1000 Genomes project data, we find a shrinkage bias that becomes large for PC5 and beyond. We then demonstrate how to obtain unbiased projections efficiently using bigsnpr. Overall, we believe this work would be of interest for anyone using PCA in their analyses of genetic data, as well as for other omics data.Availability and implementationR packages bigsnpr and bigutilsr can be installed from either CRAN or GitHub (see https://github.com/privefl/bigsnpr). A tutorial on the steps to perform PCA on 1000G data is available at https://privefl.github.io/bigsnpr/articles/bedpca.html. All code used for this paper is available at https://github.com/privefl/paper4-bedpca/tree/master/code.Supplementary information Supplementary data are available at Bioinformatics online.

Highlights

Principal Component Analysis (PCA) has been widely used in genetics for many years and in many contexts
To demonstrate that we provide very fast implementations of the different methods presented in this paper, we apply them to the UK Biobank
When applying our automatic procedure to remove long-range Linkage Disequilibrium (LD) regions, it does not converge after 5 iterations for the UK Biobank, meaning that it keeps detecting long-range LD regions at each iteration

Summary

Introduction

Principal Component Analysis (PCA) has been widely used in genetics for many years and in many contexts. Including PCs that capture LD as covariates in genetic analyses can lead to reduced power for detecting genetic association within these LD regions Another issue may arise when projecting PCs of a reference dataset to another study dataset: projected PCs are shrunk towards 0 in the new dataset (Lee et al 2010; Wang et al 2015; Zhang et al 2019). We derive implementations of truncated PCA and other useful functions for e.g. performing LD thinning and computing various statistics We make these available in a new release of R package bigsnpr (v1.0.0); what differs from previously available functions presented in Privé et al (2018) is that these new functions can be used directly on PLINK bed/bim/fam files with some missing values. We explore options to detect outlier samples in PCA, either a few outlier samples that may correspond to e.g. batch effects or family structure, or when the goal is to restrict the data to individuals of homogeneous ancestry

Material and Methods

Efficient implementation of PCA for genotype data

Robust Mahalanobis distance

Detecting LD structure in PCA

Detecting outlier samples in PCA

Projecting PCs from a reference dataset

Application to the UK Biobank

Outlier sample detection

Projecting onto the PCA space from a reference dataset

Capturing subtle population structure in the UK Biobank

Discussion

Calculate QT Q where

Calculate

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Bioinformatics	Publication Date: May 16, 2020
Citations: 84	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Efficient toolkit implementing best practices for principal component analysis of population genetic data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics

Lead the way for us

Similar Papers

Guidelines for genetic data analysis
Robin S Waples
J. Cetacean Res. Manage. | VOL. 18
Robin S WaplesRobin S Waples
24 Jan 2023
J. Cetacean Res. Manage. | VOL. 18

S01.07 The U19 Plans for Integration of Biomarkers Into Future Lung Cancer Screening
R.J Hung ... C Amos
Journal of Thoracic Oncology | VOL. 14
R.J Hung, et. al.R.J Hung ... C Amos
01 Oct 2019
S01.07 The U19 Plans for Integration of Biomarkers Into Future Lung Cancer Screening
R.J Hung ... C Amos

Spatial population genetics: Geographical Genetics by Bryan K. Epperson. Princeton University Press, 2003. US$39.95 £26.95 pbk (376 pages) ISBN 0 691 08669 9
Ian Wilson
Trends in Ecology & Evolution | VOL. 19
Ian WilsonIan Wilson
21 Jan 2004
Trends in Ecology & Evolution | VOL. 19

Pegas: an R package for population genetics with an integrated–modular approach
Emmanuel Paradis
Bioinformatics | VOL. 26
Emmanuel ParadisEmmanuel Paradis
01 Feb 2010
Bioinformatics | VOL. 26

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient toolkit implementing best practices for principal component analysis of population genetic data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Bioinformatics