Abstract

Instance selection aims at filtering out noisy data (or outliers) from a given training set, which not only reduces the need for storage space, but can also ensure that the classifier trained by the reduced set provides similar or better performance than the baseline classifier trained by the original set. However, since there are numerous instance selection algorithms, there is no concrete winner that is the best for various problem domain datasets. In other words, the instance selection performance is algorithm and dataset dependent. One main reason for this is because it is very hard to define what the outliers are over different datasets. It should be noted that, using a specific instance selection algorithm, over-selection may occur by filtering out too many ‘good’ data samples, which leads to the classifier providing worse performance than the baseline. In this paper, we introduce a dual classification (DuC) approach, which aims to deal with the potential drawback of over-selection. Specifically, performing instance selection over a given training set, two classifiers are trained using both a ‘good’ and ‘noisy’ sets respectively identified by the instance selection algorithm. Then, a test sample is used to compare the similarities between the data in the good and noisy sets. This comparison guides the input of the test sample to one of the two classifiers. The experiments are conducted using 50 small scale and 4 large scale datasets and the results demonstrate the superior performance of the proposed DuC approach over the baseline instance selection approach.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call