Training Set Compression by Incremental Clustering

Dalong Li,Steven J Simske

doi:10.13176/11.254

Abstract

Abstract Compression of training sets is a technique for reducing training set size without degrading classiﬁ-cation accuracy. By reducing the size of a training set, training will be more efﬁcient in addition tosaving storage space. In this paper, an incremental clustering algorithm, the Leader algorithm, is usedto reduce the size of a training set by effectively subsampling the training set. Experiments on sev-eral standard data sets using SVM and KNN as classiﬁers indic ate that the proposed method is moreefﬁcient than CONDENSE in reducing the size of training set w ithout degrading the classiﬁcationaccuracy. While the compression ratio for the CONDENSE method is ﬁxed, the proposed methodoffers variable compression ratio through the cluster threshold value.Keywords: Clustering, Support vector machine, KNN, Pattern recognition, CONDENSE. 1. Introduction The training and/or testing complexity of a classiﬁer usual ly depends on the size of the trainingset, e.g. the nearest neighbor (NN) classiﬁer [1]. Nearest n eighbor and its generalized form, theK-nearest neighbor (KNN) classiﬁer, are among the most popular non-parametric classiﬁers. Themembership of an unknown sample is classiﬁed based on the maj ority vote of the K nearest neigh-bors. There is no explicit learning from the training set. The entire training set itself deﬁnes thedecision boundaries. It is conceptually simple and shows good performance in many applications,e.g. it was used in face recognition for visitor identiﬁcati on [2] and it was shown that it outper-formed more sophisticated algorithms that use Principal Components Analysis (PCA) and neuralnetworks. Unfortunately, when the size of the training set is high, it requires a lot of memory tostore the entire training set and it also takes longer to search for the nearest neighbors of a given testpattern to make a single membership classiﬁcation. Obvious ly, reducing the size of a training setcan improve the space and time efﬁciency of KNN. There has bee n considerable interest in reducingthe training set size by editing, especially in the context of NN. Different proximity graphs (such asDelaunay triangulation) may be used for editing NN rules [3,4]. Complexities of such approachesare prohibitively high. For example, the Voronoi diagram has worst case complexity ofΘ(n

Full Text