Assessing the factors influencing the performance of machine learning for classifying haplogroups from Y-STR haplotypes

Guang-Yao Fan

doi:10.1016/j.forsciint.2022.111466

Abstract

Two distinct genetic markers, single nucleotide polymorphisms (Y-SNPs) and short tandem repeats (Y-STRs), exist simultaneously in the non-recombining portion of the Y chromosome. Because of their different rates of mutation, Y-STRs and Y-SNPs play distinct roles in forensic and evolutionary genetics. Current approaches to infer haplogroup status rely on genotyping lots of Y-SNP loci. Given the relationship between haplotype and haplogroup of a Y chromosome, a cost-effective strategy of Y-STRs typing had an advantage in haplogroup prediction. Many machine learning algorithms have sprung up for assigning a Y-STR haplotype to a haplogroup. However, a series of issues must be solved before the using of machine learning method in practice. Thus, the k-nearest neighbor (kNN) classifier was built respectively based on different situations in this study. We assessed different factors which may influence the performance of the kNN prediction model for classifying haplogroups. The training set was based on a diverse ground-truth data set comprising Y-STR haplotypes and corresponding Y-SNP haplogroups. Our results showed that combining different levels of haplogroups into the observations or transracial prediction was impractical. Moreover, using more slow mutation Y-STR loci in the category is good for promoting classification accuracy. The preconditions for an effective and accurate haplogroup assignment by the kNN classifier were revealed.

Full Text