Comparison of Data Mining Classification Algorithms on Educational Data under Different Conditions

İlhan Koyuncu,Selahattin Gelbal

doi:10.21031/epod.696664

Abstract

The purpose of this study was to examine the performance of Naive Bayes, k-nearest neighborhood, neural networks, and logistic regression analysis in terms of sample size and test data rate in classifying students according to their mathematics performance. The target population was 62728 students in the 15-year-old group who were participated in the Programme for International Student Assessment (PISA) in 2012 from The Organisation for Economic Co-operation and Development (OECD) countries. The performance of each algorithm was tested by using 11%, 22%, 33%, 44% and 55% of each dataset for small (500 students), medium (1000 students) and large (5000 students) sample sizes. 100 replications were performed for each analysis. As the evaluation criteria, accuracy rates, RMSE values, and total elapsed time were used. RMSE values for each algorithm were statistically compared by using Friedman and Wilcoxon tests. The results revealed that while the classification performance of the methods increased as the sample size increased, the increase of training data ratio had different effects on the performance of the algorithms. The Naive Bayes showed high performance even in small samples, performed the analyzes very quickly, and was not affected by the change in the training data ratio. Logistic regression analysis was the most effective method in large samples but had a poor performance in small samples. While neural networks showed a similar tendency, its overall performance was lower than Naive Bayes and logistic regression. The lowest performances in all conditions were obtained by the k-nearest neighborhood algorithm.

Highlights

Data mining is used to discover hidden patterns and relationships that help decision making by processing large amounts of data (Bhardwaj & Pal, 2011)
The aim of this study is to examine the performance of Naive Bayes, k-nearest neighborhood, neural networks, and logistic regression analysis in terms of sample size and training data ratio in classifying students according to their Programme for International Student Assessment (PISA) mathematics performance
While the Naive Bayes (NB) method showed the highest performance in the sample of 500 students, the logistic regression (LR) method showed the highest performance in the samples of 1000 and 5000 students

Summary

Introduction

Data mining is used to discover hidden patterns and relationships that help decision making by processing large amounts of data (Bhardwaj & Pal, 2011). A wide variety of methods based on mathematical and statistical algorithms are used to predict, cluster, and reveal relationship networks in many disciplines. Data mining methods, which are used in a wide range from marketing to engineering, from health sciences to business, have started to be used to examine large and complex educational datasets that have been increasing rapidly with technological developments. Predicting student success is the focus of many kinds of research in education. Today, while technology is developing rapidly and gaining more importance in education, there are databases that contain many factors that affect student success. In addition to the course management systems that include rich educational data sources such as Blackboard and Moodle, data is collected at the student, teacher, school, regional and country level in large scale assessments such as Trends In International

Objectives

Methods

Results

Conclusion