Abstract

Given the growing amount of data produced from within different areas of knowledge, data mining methods currently have to face challenging datasets with greater numbers of instances and attributes. However, the processing capacity of data mining algorithms is struggling under this growth. One alternative for tackling the problem is to perform instance selection on the data in order to reduce its size, as a preprocessing step for data mining algorithms.This study presents e-MGD, a method for instance selection as an extension of the Markov Geometric Diffusion method, which is a linear complexity method used in computergraphics for the simplification of triangular meshes. The original method was extended so that it was capable of reducing datasets commonly found in the field of data mining. For this purpose, two essential points of adjustment were required. Firstly, it was necessary to build a geometric structure from the data and secondly, to adjust the method so that it could deal with types of attributes encountered within these datasets. These adjustments however, did not influence the complexity of the final e-MGD, since it remained linear, which enabled it to be applied to datasets with a greater number of instances and features. One distinct characteristic of the proposed extension is that it focuses on preserving dataset information rather than improving classification accuracy, as in the case of most instance selection methods.In order to assess the performance of the method, we compared it with a number of classical and contemporary instance selection methods using medium to large datasets, plus a further set of very large datasets. The results demonstrated a good performance in terms of classification accuracy when compared to results from other methods, indicating that the e-MGD is a good alternative for instance selection.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call