Abstract

Nowadays, the massive production of raw multivariate data implies an accumulation of redundant information. When machine learning approaches learn from unfiltered data, the training time might increase significantly. This problem can be overcome with instance selection (IS) to obtain a subset of patterns that keep the original underlying distribution. IS has been often addressed as an optimization problem solved by evolutionary algorithms, which usually adopt a binary representation to encode all the instances explicitly; hence, the larger the number of instances, the greater the search space. We propose an IS method based on linkage trees, where cut-off levels per class are encoded to reduce the search space. Thus, the codification length equals the number of classes, which is considerably smaller than the binary codification size. Clusters of instances are created from cut-off levels, and their medoids form the selected subset of instances. The linkage trees cut-off levels are optimized by simulated annealing to find the best subset of instances that preserves the original data distribution with a reduced number of instances. The experimental results on synthetic and realworld datasets show that the proposed approach outperforms IS methods from the literature regarding a tradeoff between classification accuracy, reduction rate, and density preservation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call