Parameter-Free Conglomerate nearest Neighbor Classifier Using Mass-Ratio-Variance Outlier Factors

Chulalongkorn University, Bangkok, Thailand ,Patcharasiri Fuangfoo,Krung Sinapiromsaran

doi:10.18178/ijml.2023.13.4.1145

Chulalongkorn University, Bangkok, Thailand , Patcharasiri Fuangfoo + Show 1 more

Open Access

https://doi.org/10.18178/ijml.2023.13.4.1145

Copy DOI

Abstract

Classification is one important area in machine learning that labels the class of an instance via a classifier from known-class historical data. One of the popular classifiers is k-NN, which stands for “k-nearest neighbor” and requires a global parameter k to proceed. This global parameter may not be suitable for all instances. Naturally, each instance may situate on different regions of clusters such as an interior instance placed inside a cluster, a border instance placed on the outskirts, an outer instance placed faraway from any cluster, which requires a different number of neighbors. To automatically assign a different number of neighbors to each instance, the concept of scoring from the anomaly detection research is desired. The Mass-ratio-variance Outlier Factor, MOF, is selected as the scoring scheme for the number of neighbors of each instance. MOF gives the highest score to an instance placed very far from any cluster and the lowest score to an instance surrounded by other instances. This leads to the proposed classifier called the conglomerate nearest neighbor classifier, which does not require any parameter assigning the appropriate number of neighbors to each instance ordered by MOF. Experimental results show that this classifier exhibits similar accuracy to the k-nearest neighbor algorithm with the best k over the synthesized datasets. Six UCI datasets, the QSAR dataset, the German dataset, the Cancer dataset, the Wholesale dataset, the Haberman dataset, and the Glass3 dataset are used in the experiment. This method outperforms two UCI datasets, Wholesale and Glass3, and displays similar performance with respect to these six UCI datasets.

Full Text