Abstract

K-Nearest Neighbors (KNN) has remained one of the most popular methods for supervised machine learning tasks. However, its performance often depends on the characteristics of the dataset and on appropriate feature scaling. In this paper, we explore characteristics of a dataset that make it suitable for being used within KNN. As part of this, two new measures for dataset dispersion, called mean neighborhood target standard deviation (MNTSD), and mean neighborhood target entropy (MNTE) are formulated to determine the expeced performance while using KNN regressors and classifiers, respectively. It is empirically demonstrated that these measures of dispersion can be indicative of the performance of KNN regression and classification. This idea is further used to learn feature weights that help improve the accuracy of KNN classification and regression. For this, it is argued that the MNTSD and MNTE, when used to learn feature weights, cannot be optimized using gradient-based optimization methods and we develop optimization strategies based on metaheuristic methods, namely genetic algorithms and particle swarm optimization. The feature-weighting method is tried in both regression and classification contexts on publicly available datasets, and the performance is compared to KNN without feature weighting. The results indicate that the performance of KNN with appropriate feature weighting leads to better performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call