Optimizing variable selection and neighbourhood size in the K-nearest neighbour algorithm

Ka Yuk Carrie Lin

doi:10.1016/j.cie.2024.110142

Abstract

The K-nearest neighbour (KNN) algorithm is one of the well-known classifiers applied in various research areas. The input requirement includes a set of variables, the choice of the neighbourhood size (K) and the distance metric which are usually selected based on data characteristics. The first two are usually decided sequentially in previous studies. This paper proposes a mixed integer linear program for simultaneous variable selection and determination of neighbourhood size. The Euclidean distance is used but the model constraints can be adapted for other distance metrics. The proposed model adopts accuracy and recall as objective functions, respectively, to determine the best combination of the two decisions in binary classification problems. Computational experiments are designed with ten publicly available datasets. Results showed that using at least half of all variables with smaller K value can already achieve better or equally good classification accuracy and recall rates, respectively. An effective set of variables and small neighbourhood size in KNN can facilitate solving classification problems.

Full Text