Identifying the biogeographic ancestral origin of biological sample left at a crime scene can provide important evidence for judicial case, as well as clue for narrowing down suspect. Ancestry informative single nucleotide polymorphism (AISNP) has become one of the most important genetic markers in recent years for screening ancestry information loci and analyzing the population genetic background and structure due to their high number and wide distributions in the human genome. In this study, based on data from 26 populations in the 1000 Genomes Project Phase 3, a Random Forest classification model was constructed with one-vs-rest classification strategy for embedded feature selection in order to obtain a panel with a small number of efficient AISNPs. The research aim was to clarify differentiations of population genetic structures among continents and subregions of East Asia. ADMIXTURE results showed that based on the 58 AISNPs selected by the machine learning algorithm, the 26 populations involved in the study could be categorized into six intercontinental ancestry components: North East Asia, South East Asia, Africa, Europe, South Asia, and America. The 24 continental-specific AISNPs and 34 East Asian-specific AISNPs were finally obtained, and used to construct the ancestry prediction model using XGBoost algorithm, resulting in the Matthews correlation coefficients of 0.94 and 0.89, and accuracies of 0.94 and 0.92, respectively. The machine learning models that we constructed using population-specific AISNPs were able to accurately predict the ancestral origins of continental and intra-East Asian populations. To summarize, screening a set of high-perform AISNPs to infer biogeographical ancestral information using embedded feature selection has potential application in creating a layered inference system that accurately differentiates from intercontinental populations to local subpopulations.
Read full abstract