Abstract
This paper focused on predicting diabetes disease using machine learning models which is a very active and highly important area of research. Six machine learning methods and three diabetes datasets were experimented with to investigate model performances. The methods are logistic regression, k-Nearest Neighbour, Gaussian Naïve Bayes, Decision Tree, Random Forest, and XGBoost. The datasets are Pima Indian, the Frankfurt Hospital dataset, and the combined dataset where all datasets have 08 (eight) feature variables and 01 (one) target variable. Train-test data split ratio can make a significant difference in model performance. Hence, two different split ratios 50-50 and 90-10 were experimented. Model performances were evaluated using four performance metrics which are precision, recall, F1-score, and accuracy. Random Forest and XGBoost were found to be highly efficient and best-performing among all the methods based on all performance metrics, all datasets, and both train-test split ratios. They performed comparatively better with the combined dataset which involved 2768 instances indicating the importance of a large dataset for better results. Also, the 90-10 train-test split ratio produced comparatively improved results than the 50-50 split ratio for all the datasets and even for almost all models.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have