Assigning individual animals to their respective breeds, populations or lineages has immense significance in the evolutionary analyses of global cattle populations besides detecting the underlying genetic variation that may likely have facilitated the adaptation of these breeds to diverse environmental conditions. It is also important in discovering the geographic patterns of genetic variation in cattle populations as well as tracing the geographical origin of breeds, food products, and diseases. Given this, the present study was undertaken to elucidate the minimum number of informative single nucleotide polymorphism (SNP) markers, originally generated using medium-density BovineSNP50 BeadChip across 1823 individuals represnting 73 populations, to assign individual animals to the corresponding lineage/group (African or European or Indicine or admixed) and respective populations within that lineage/group using two well-known supervised machine learning (ML) algorithms namely Random Forest (RF) and Extreme Gradient Boosting (XGBoost). Each of the two ML models were trained with the most informative SNP panels (with sizes of 48, 96, and 192) that were elucidated using two statistical methods i.e., principal component analysis (PCA) and Wright's fixation index (FST), and two ML methods (RF with Gini, and RF with MDA). Three panels with the topmost discriminant SNPs (at 192, 96, and 48 densities) were created for each of the marker preselection approaches. These panels were evaluated, based on their performance vis-à-vis animals’ assignment to respective lineage, population group or population. The results showed that XGBoost achieved the best accuracy of 95% with 192-SNP panel (selected via RF with MDA), followed by RF (93% accuracy) with 192-SNP panel (selected via RF with either Gini or MDA), for animal to lineage assignment. Similarly, RF trained with 48-SNP panel (selected via RF with Gini algorithm) achieved the best accuracy of 97% for assigning animals to African lineage, while it achieved the best accuracy of 89% for assigning animals to admixed populations using 96-SNP panel (selected via PCA). On the other hand, XGBoost achieved the best accuracy of 88% for assigning animals to European breeds using 192-SNP panel (selected via FST method). Furthermore, the results with both RF and XGBoost achieved a poor performance of assigning animals of Indicine lineage to the correct group as the best accuracy for such assignment was 66%, achieved using RF with 192-SNP panel (selected via FST method). In conclusion, the study reports the applicability of statistical and ML approaches for identification of discriminatory SNPs, useful the assignment of individuals to corresponding lineages and to respective populations within lineages besides revealing the efficiency of XGBoost and RF-based ML models in performing such assignments. Both the ML models achieved better performance as compared to statistical ones in assigning the animals to specific lineages while they faired comparably similar to each other for the assignment of individuals to respective populations within respective lineages or population groups.
Read full abstract