Comparative Analysis of Classification Models for Pima Dataset

Raghad Sehly,Mohammad Mezher

doi:10.1109/iccit-144147971.2020.9213821

Abstract

Nowadays the amount of data is rapidly increasing. For example, in 2019, International Telecommunication Union ITU states that the number of Internet users has become about 4.1 billion (53.6% of the global population). The big amount of data exceeds our ability to analyze and extract useful information without the help of computer techniques. Data mining is a common technique used in Machine Learning (ML) to extract useful knowledge from big data. Classification algorithms are also widely used for achieving accurate prediction. The classification techniques compared here were K-Nearest Nearest Neighbor (K-NN), Radial Basis Function Support Vector Machine (RBF SVM), Linear SVM, Sigmoid SVM, Logistic Regression (LR), Linear Discriminant Analysis (LDA), Classification and Regression Trees (CART), and Naive Bayes (NB). This study aims at comparing the accuracy of six classification techniques using the confusion matrix evaluation model. The UCI PIMA Indian Diabetes Dataset is considered and deployed on the Anaconda python platform. The results showed that the achieved accuracy by using K-NN is 0.7265, by RBF SVM is 0.612, by Linear SVM is 0.7721, by Sigmoid SVM is 0.6510, by LR is 0.7695, by LDA is 0.7734, by CART is 0.6952, and by NB 0.7551.

Full Text