Large-Scale Instance Selection Using a Heterogeneous Value Difference Matrix

Chatchai Kasemtaweechok,Chatchavin Hathorn,Nitiporn Sukkerd

doi:10.1007/978-981-15-4917-5_34

Abstract

Data classification of a large-scale dataset is a common problem nowadays because the classifier model takes an overwhelming amount of time to completely learn all the data. The instance selection algorithm is a well-known technique that addresses this issue by reducing the size of the training set. Instance selection methods decrease the difficulty of data classification and improve the quality of the training data. This paper proposed a novel instance selection method using a heterogeneous value difference matrix (HVDM) distance function. The proposed method selected a set of median HVDM values in each partition as a reduced training set. We compared the proposed method with the condensed nearest neighbor (CNN) and instance-based learning (IB3) methods. Five large-scale datasets from the UCI data repository were tested with three classifier models (decision tree, neural net, and support vector machine). The accuracy and kappa of the proposed method were better than those of the other two methods, and the proposed method had a moderate reduction rate. However, the accuracy and kappa of the proposed method were nearly equal to those of the original training set.

Full Text