Abstract

Assessing DNA to determine the biogeographic ancestry of an individual continues to be a major task in forensic laboratories across the world. Due to the costly nature associated with full-scale genomic data acquisition and processing, many forensic laboratories lack the ability to conduct comprehensive genetic testing involving analyzing ancestry-informative single nucleotide polymorphisms (aiSNP), therefore, creating the need for more cost effective sources of information. In the present study, we assessed the use of machine learning (ML) approaches in the analysis of short tandem repeats (STRs), non-coding repeats of a short sequence of DNA, in order to determine biogeographic ancestry. STRs are repeat sequences in which a unit of 1-to-25 nucleotides in length exists at various locations across the genome. Because of the high variability of STRs, STRs are widely used for creating unique genetic profiles of different individuals. We analyzed the performance of selected loci in random forest classification models using anonymized STR data, provided by the US Department of Defense (DoD), collected from <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathrm{N}=1747$</tex> subjects across <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\mathrm{K}=5$</tex> continents in order to predict the continental origins of each individual given their genome. Supervised classification test accuracy of subjects varied from <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\sim45\%$</tex> to <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$&gt; 60\%$</tex> while 10-fold training accuracy varied from 60% to <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\sim80\%$</tex> across the profiles surveyed. Unsupervised clustering test accuracy was reported to be <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$\sim35\%$</tex> . Our findings indicate that there is a significant possibility in using STR data as a novel method for continental ancestry prediction, and with further research, high accuracy may be reached. We conclude this article with comments on future strategies for parameter optimization to maximize utility of STR analysis which may be beneficial to smaller laboratories as well as expedite biogeographic ancestry for forensic professionals and law enforcement officials.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call