Abstract

Abstract The target (dependent) variable is often influenced not only by ratio scale variables, but also by qualitative (nominal scale) variables in classification analysis. Majority of machine learning techniques accept only numerical inputs. Hence, it is necessary to encode these categorical variables into numerical values using encoding techniques. If the variable does not have relation or order between its values, assigning numbers will mislead the machine learning techniques. This paper presents a modified k-nearest-neighbors algorithm that calculates the distances values of categorical (nominal) variables without encoding them. A student’s academic performance dataset is used for testing the enhanced algorithm. It shows that the proposed algorithm outperforms standard one that needs nominal variables encoding to calculate the distance between the nominal variables. The results show the proposed algorithm preforms 14% better than standard one in accuracy, and it is not sensitive to outliers.

Highlights

  • Data understanding is an important step for accurate analysis

  • The Educational Data Mining (EDM) is an evolving discipline that deals with the creation of methods for exploring the specific and increasingly large-scale knowledge that comes from educational environments and using these methods to better understand students and the environments in which they learn [3, 4]

  • The k-Nearest Neighbors (kNN) is one of the most popular classification algorithms due to its simplicity [5]. It stores all available cases and classifies new cases based on a similarity measure. It classifies a new sample by a majority vote of its neighbors, with the case being assigned to the group most common amongst its k nearest neighbors kNN measured by a distance function

Read more

Summary

INTRODUCTION

Data understanding is an important step for accurate analysis. Data pre-processing is the first step needed to aid algorithms and to improve efficiency before proceeding to the actual analysis. Data variables generally fall into one of the four broad categories: nominal scale, ordinal scale, interval scale, and ratio scale [1]. Gender nominal variable in the datasets which take (male, female). Ratio scale possesses qualities of nominal, ordinal and interval scales, has absolute zero value. In addition to, it permits comparisons between different variables values. Assigning numerical values to nominal attributes misleads the machine learning algorithms learning by making difference or order between values that are not originally existed in the attributes and this phenomenon is called subjectivity. This research proposes two similarity measures for kNN algorithm to deal with categorical variables without converting them as numerical.

RELATED WORKS
Distance functions
PROPOSED KNN ALGORITHM
DATA SET
Data mining
RESULT
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call