Abstract

BackgroundWhile continental level ancestry is relatively simple using genomic information, distinguishing between individuals from closely associated sub-populations (e.g., from the same continent) is still a difficult challenge.MethodsWe study the problem of predicting human biogeographical ancestry from genomic data under resource constraints. In particular, we focus on the case where the analysis is constrained to using single nucleotide polymorphisms (SNPs) from just one chromosome. We propose methods to construct such ancestry informative SNP panels using correlation-based and outlier-based methods.ResultsWe accessed the performance of the proposed SNP panels derived from just one chromosome, using data from the 1000 Genome Project, Phase 3. For continental-level ancestry classification, we achieved an overall classification rate of 96.75% using 206 single nucleotide polymorphisms (SNPs). For sub-population level ancestry prediction, we achieved an average pairwise binary classification rates as follows: subpopulations in Europe: 76.6% (58 SNPs); Africa: 87.02% (87 SNPs); East Asia: 73.30% (68 SNPs); South Asia: 81.14% (75 SNPs); America: 85.85% (68 SNPs).ConclusionOur results demonstrate that one single chromosome (in particular, Chromosome 1), if carefully analyzed, could hold enough information for accurate prediction of human biogeographical ancestry. This has significant implications in terms of the computational resources required for analysis of ancestry, and in the applications of such analyses, such as in studies of genetic diseases, forensics, and soft biometrics.

Highlights

  • While continental level ancestry is relatively simple using genomic information, distinguishing between individuals from closely associated sub-populations is still a difficult challenge

  • We address the problems of both continental-level and sub-continental level ancestry identification using small Single nucleotide polymorphisms (SNP) panels, with all SNPs in the panel coming from just one single chromosome

  • In our specific problem of ancestry classification, SNPs that contain similar ancestry information are clustered together, while those that could not be clustered into some group are identified as outliers with seemingly unique ancestry information. We have considered these outlier SNPs as good candidates for distinguishing biogeographical ancestry between populations

Read more

Summary

Introduction

While continental level ancestry is relatively simple using genomic information, distinguishing between individuals from closely associated sub-populations (e.g., from the same continent) is still a difficult challenge. Accurate inference of biogeographical ancestry is important for various application areas. Identifying ancestry informative markers (AIMs) in the genome is essential for detecting such stratification in case-control association studies of complex diseases, such as cancer, diabetes, neurodegenerative diseases (e.g., Alzheimer’s disease), and cardiovascular diseases [1,2,3]. Reliable estimation of biogeographic ancestry is a key procedure in studies of admixed populations. Many studies are investigating the association between ancestry and certain types of diseases [11,12,13]. Analysis of genetic ancestry is a vast research area with numerous applications, which has attracted the use of a diverse array of techniques

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call