Selecting SNPs informative for African, American Indian and European Ancestry: application to the Family Investigation of Nephropathy and Diabetes (FIND).

Robert C Williams,Robert C Elston,Kent D Taylor,Rulan S Parekh,Xiuqing Guo,Eli Ipp,Robert G Nelson,William C Knowler,Vallabh O Shah,Barry I Freedman,Pankaj Kumar,Carl D Langefeld,Jeffrey R Schelling,Jasmin Divers,Jerome I Rotter,John R Sedor,Farook Thameem,Paul L Kimmel,Robert L Hanson,David J Leehey,Hanna E Abboud,Cheryl A Winkler,Susanne B Nicholas,Donald W Bowden,Michael J Klag,Sudha K Iyengar,Robert P Igo,Madeleine V Pahl,O Köhn ,Phillip G Zager ,Michael W Smith,Sharon G Adler ,Denyse Thornley‐Brown

doi:10.1186/s12864-016-2654-x

Abstract

BackgroundThe presence of population structure in a sample may confound the search for important genetic loci associated with disease. Our four samples in the Family Investigation of Nephropathy and Diabetes (FIND), European Americans, Mexican Americans, African Americans, and American Indians are part of a genome- wide association study in which population structure might be particularly important. We therefore decided to study in detail one component of this, individual genetic ancestry (IGA). From SNPs present on the Affymetrix 6.0 Human SNP array, we identified 3 sets of ancestry informative markers (AIMs), each maximized for the information in one the three contrasts among ancestral populations: Europeans (HAPMAP, CEU), Africans (HAPMAP, YRI and LWK), and Native Americans (full heritage Pima Indians). We estimate IGA and present an algorithm for their standard errors, compare IGA to principal components, emphasize the importance of balancing information in the ancestry informative markers (AIMs), and test the association of IGA with diabetic nephropathy in the combined sample.ResultsA fixed parental allele maximum likelihood algorithm was applied to the FIND to estimate IGA in four samples: 869 American Indians; 1385 African Americans; 1451 Mexican Americans; and 826 European Americans. When the information in the AIMs is unbalanced, the estimates are incorrect with large error. Individual genetic admixture is highly correlated with principle components for capturing population structure. It takes ~700 SNPs to reduce the average standard error of individual admixture below 0.01. When the samples are combined, the resulting population structure creates associations between IGA and diabetic nephropathy.ConclusionsThe identified set of AIMs, which include American Indian parental allele frequencies, may be particularly useful for estimating genetic admixture in populations from the Americas. Failure to balance information in maximum likelihood, poly-ancestry models creates biased estimates of individual admixture with large error. This also occurs when estimating IGA using the Bayesian clustering method as implemented in the program STRUCTURE. Odds ratios for the associations of IGA with disease are consistent with what is known about the incidence and prevalence of diabetic nephropathy in these populations.Electronic supplementary materialThe online version of this article (doi:10.1186/s12864-016-2654-x) contains supplementary material, which is available to authorized users.

Highlights

The presence of population structure in a sample may confound the search for important genetic loci associated with disease
The power of each single nucleotide polymorphisms (SNPs) to estimate individual genetic ancestry (IGA) is proportional to the magnitude of the allele frequency difference between the two parental populations, or δ, in the three difference-contrasts for each marker, |PEU-PAI|, |PEUPAF|, and |PAI-PAF|, and the information-for-assignment statistic information contrast (In), which was calculated for each contrast (Table 2)
2) A set of ancestry informative markers is provided for estimating American Indian ancestry that reflects an ancestral tribe from the Paleo-Indian migration across the Bering Strait, the provided in supplementary tables of American Indian (Pima) Indians [33], who are the most completely characterized Indian group in North America

Summary

Introduction

The presence of population structure in a sample may confound the search for important genetic loci associated with disease. Our four samples in the Family Investigation of Nephropathy and Diabetes (FIND), European Americans, Mexican Americans, African Americans, and American Indians are part of a genome- wide association study in which population structure might be important. From SNPs present on the Affymetrix 6.0 Human SNP array, we identified 3 sets of ancestry informative markers (AIMs), each maximized for the information in one the three contrasts among ancestral populations: Europeans (HAPMAP, CEU), Africans (HAPMAP, YRI and LWK), and Native Americans (full heritage Pima Indians). Data from the Pima Indian GWAS, conducted with the Affymetrix Genome-Wide Human 6.0 SNP array [11], were used to isolate informative markers for IGA in American Indians, which were combined with 3 populations from HapMap to create a panel of AIMs

Methods

Results

Discussion

Conclusion