Abstract
A single nucleotide polymorphism (SNP) is a DNA sequence variation in a population. SNP is just a single nucleotide difference in the genome. Many statistical methods have been proposed to predict the racial classification of individuals based on SNP genetic data. The selection of the right classification method is very important because it will determine the accuracy of the classification results. This research aims to identify the highest average accuracy between two popular classification methods in Machine Learning (ML), including K-Nearest Neighborhood (KNN) and Support Vector Machine (SVM). This study used SNP genetic data for 120 samples from 2 CEU-European races and Yoruba-African races, where for each sample 10 SNPs were selected with the same location identity. The experiment was carried out by testing each classification method with variations in the percentage of test data 10, 20, 30, 40 and 50, which was combined with Euclidean distance for the KNN classification method. Based on the results of the study, the accuracy of the prediction of the classification of individual races based on SNP genetic data, the classification using KNN has an average prediction accuracy that is better than the SVM classification if the SNP location used tests has a high correlation with the sample class. In this case, the highest average accuracy value of KNN is 98.906% and SVM is 98.779%. There is a significant difference between the highest average accuracy of KNN and SVM based on the Wilcoxon statistical test with a significance level of α = 0.05. Benefits of this research are to find the right classification method for predictions of individual racial classification based on SNP genetic data.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have