Abstract

Combining samples across multiple cohorts in large-scale scientific research programs is often required to achieve the necessary power for genome-wide association studies. Controlling for genomic ancestry through principal component analysis (PCA) to address the effect of population stratification is a common practice. In addition to local genomic variation, such as copy number variation and inversions, other factors directly related to combining multiple studies, such as platform and site recruitment bias, can drive the correlation patterns in PCA. In this report, we describe the combination and analysis of multi-ethnic cohort with biobanks linked to electronic health records for large-scale genomic association discovery analyses. First, we outline the observed site and platform bias, in addition to ancestry differences. Second, we outline a general protocol for selecting variants for input into the subject variance-covariance matrix, the conventional PCA approach. Finally, we introduce an alternative approach to PCA by deriving components from subject loadings calculated from a reference sample. This alternative approach of generating principal components controlled for site and platform bias, in addition to ancestry differences, has the advantage of fewer covariates and degrees of freedom.

Highlights

  • To reach the statistical power needed for genome-wide association studies, large numbers of participants are needed

  • This can be achieved through large research networks such as the Electronic Medical Records and Genomics Network, which comprises a multi-ethnic cohort of ∼57,000 participants linked to electronic health records (EHRs) for phenotype mining from nine participating sites in the United States (U.S.) (Gottesman et al, 2013)

  • To assess ancestry in related individuals, Zhu et al (2008) introduced a method of generating principal components (PCs) by deriving SNP loadings from founders, and applying them to the entire sample. We introduce this concept of deriving SNP loadings from the BEAGLE imputation 1000 Genomes reference sample, and apply it to the entire imputed sample set of 57,000 genotyped individuals from the eMERGE Network as an alternative approach to control for site and platform bias in addition to ancestry differences for our large cohort

Read more

Summary

Introduction

To reach the statistical power needed for genome-wide association studies, large numbers of participants are needed. This can be achieved through large research networks such as the Electronic Medical Records and Genomics (eMERGE) Network, which comprises a multi-ethnic cohort of ∼57,000 participants linked to electronic health records (EHRs) for phenotype mining from nine participating sites (seven adult; two pediatric) in the United States (U.S.) (Gottesman et al, 2013). When combining genetic data from diverse data sets, understanding the contribution of ancestry, genotyping platform, and site bias are of vital importance. Imputation using the BEAGLE software was carried out to allow merging of the diverse data sets

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.