Abstract

Next-generation sequencing technologies now make it possible to sequence and genotype hundreds of thousands of genetic markers across the human genome. Selection of informative markers for the comprehensive characterization of individual genomic makeup using a high dimensional genomics dataset has become a common practice in evolutionary biology and human genetics. Although several feature selection approaches exist to determine the ancestry proportion in two-way admixed populations including African Americans, there are limited statistical tools developed for the feature selection approaches in three-way admixed populations (including Latino populations). Herein, we present a new likelihood-based feature selection method called Lancaster Estimator of Independence (LEI) that utilizes allele frequency information to prioritize the most informative features useful to determine ancestry proportion from multiple ancestral populations in admixed individuals. The ability of LEI to leverage summary-level statistics from allele frequency data, thereby avoiding the many restrictions (and big data issues) that can accompany access to individual-level genotype data, is appealing to minimize the computation and time-consuming ancestry inference in an admixed population. We compared our allele-frequency based approach with genotype-based approach in estimating admixed proportions in three-way admixed population scenarios. Our results showed ancestry estimates using the top-ranked features from LEI were comparable with the estimates using features from genotype-based methods in three-way admixed population. We provide an easy-to-use R code to assist researchers in using the LEI tool to develop allele frequency-based informative features to conduct admixture mapping studies from mixed samples of multiple ancestry origin.

Highlights

  • The task of identifying and assigning an individual’s genomic regions to the correct ancestries can be difficult in admixed populations with multiple ancestral origins

  • We compared the performance of Lancaster Estimator of Independence (LEI) in selecting informative markers with existing machine learning approaches including Principal Components Analysis (PCA), Random Forest (RF), and Support Vector Machine (SVM) using real and simulated data in a logistic regression framework

  • The number of informative markers required to build an efficient classifier was found to be similar among these methods with the exception of PCA, which requires a larger number of markers to attain a comparable accuracy (Fig. 4)

Read more

Summary

Introduction

The task of identifying and assigning an individual’s genomic regions to the correct ancestries can be difficult in admixed populations with multiple ancestral origins. As a feature selection approach, these methods can be used to rank the informative SNPs, which serve as a classifier to discriminate populations based on continental ancestry All these methods require the individual-level genotype data and the selection of subsets of informative SNPs from ever-increasing genomic datasets with over millions of SNPs can be very expensive[13,15]. It is desirable to find an efficient feature selection method, which identifies the AIMs to estimate the admixture proportions in samples from admixed population with high accuracy, but is computationally feasible, cost-effective, and applicable to multi-way admixture and summary level data. We compare the performance of LEI-based approaches with standard machine learning approaches applicable for feature selection in three-way admixed population, including Principal Components Analysis (PCA), Random Forest (RF), and Support Vector Machine (SVM). As the summary level data becomes more accessible, LEI provides a methodological advancement in the feature selection process with wide applicability in multi-way admixture analysis which will become more common due to global admixture

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call